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1.  Introduction 


Visual  processing  has  predominantly  been  aimed  at  labeled,  static  images  (e.g.,  caltechlOl), 
ignoring  a)  moving  images,  which  constitute  a  vast  amount  of  visual  data  (e.g.,  youtube, 
television,  as  well  as  all  natural  visual  real-world  experience];  and  b)  unlabeled  images; 
despite  the  fact  that  labeling  is  among  the  most  time  -intensive  aspects  of  vision  research.  We 
studied  1)  the  development  of  tasks  for  visual  processing  of  moving  scenes,  to  provide  the 
iield  with  datasets  and  benchmarks,  to  begin  to  try  to  catch  up  to  the  very  large  number  of 
static  visual  datasets;  and  2]  development  and  testing  of  algorithms  for  vision  for  time- 
varying  images  (VTV),  including  evaluation  of  existing  algorithms  and  development  of  novel 
approaches.  This  grant  was  intended  to  be  a  relatively  brief  (18  month]  initial  proof  of 
principle  effort.  It  has  arguably  exceeded  its  initial  aims:  we  have  developed  novel 
algorithms  for  object  recognition  and  localization  in  both  still  images  and  in  videos,  and  we 
have  carried  out  initial  evaluations  comparing  the  new  methods  with  previous  approaches. 
The  results,  described  herein,  are  promising,  and  ongoing  work  is  aimed  at  extending  the 
initial  findings  to  include  a  suite  of  advanced  approaches  to  VTV  tasks. 


2.  Novel  supervised  learning  system 

The  results  of  this  work  have  been  based  on  algorithms  developed  from  brain  circuit 
analysis,  described  in  part  in  a  series  of  publications  (Rodriguez  et  al.,  2004;  Granger  2005; 
2006;  Felch  &  Granger  2008;  2011;  Granger  2012].  In  short,  multiple  regions  of  the  brain 
perform  individual  algorithms  in  isolation,  and  their  combined  operation  yields  a  system  that 
takes  inputs,  constructs  memory  hierarchies  incrementally  via  learning,  and  produces 
suggested  output  responses.  The  internal  representations  are  in  the  form  of  nested 
sequences  of  categories,  corresponding  to  invariant  spatiotemporal  patterns;  these  have 
been  analyzed  in  terms  of  families  of  grammars  that  encode  relations  organized 
hierarchically  (Granger  2006;  2012]. 

The  corito-striatal  loop  (CSL]  system  is  one  instance  of  a  method  that  emerges  from  the 
interaction  of  two  distinct  simpler  algorithms  (both  derived  from  brain  circuit  operation]: 
one  that  performs  the  operation  of  unsupervised  clustering,  and  the  other  performs  match- 
mismatch  signaling.  The  combined  system  operates  in  unsupervised  mode,  except  when 
presented  with  information  that  can  be  used  for  reinforcement.  For  instance,  complex 
patterns  (e.g.,  objects  with  various  shapes]  may  be  initially  learned  via  unsupervised 
relations  among  their  component  parts.  Whenever  these  unsupervised  representations  are 
found  to  be  at  odds  with  (sparse]  supervised  information  (e.g.,  when  an  input  is  categorized 
incorrectly],  the  condition  triggers  a  further  unsupervised  split  of  the  node  in  the  tree.  This 
successive  subdivision  repeats  until  a  correct  supervised  classification  is  arrived  at 
(Chandrashekar  &  Granger  2012].  The  result  integrates  unsupervised  rich  representations 
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via  extremely  inexpensive  methods,  with  sparse  reinforcement  signals  used  when  they  are 
available. 

The  CSL  algorithm  is  a  generative  method,  i.e.,  it  is  in  the  category  of  algorithms  that  model 
data  occurring  within  each  presented  class,  rather  than  "discriminative”  methods,  which 
seek  solely  to  identify  differences  between  classes.  Generative  models  are  often  taken  as 
performing  extra  work  compared  to  discriminative  models,  especially  in  cases  where  the 
only  task  is  to  distinguish  among  labeled  classes  [Ng  &  Jordan  2002].  The  CSL  method  thus 
carries  out  more  work  than  typical  classification  methods  such  as  support  vector  machines 
(SVMs).  Yet  experiments  have  been  run  to  compare  the  algorithms  against  each  other  on 
classification  tasks,  with  surprising  results.  The  classification  task  is,  for  the  CSL  algorithm, 
a  restricted  task,  since  the  algorithm  is  capable  of  many  additional  operations  [including 
unsupervised  learning,  localization,  and  others);  yet  this  restricted  task  is  among  the  most 
widely-used  applications  in  image  processing.  In  this  task,  the  CSL  algorithm  achieves 
classification  results  comparable  to  those  of  SVMs,  yet  uses  far  less  computational  cost  to  do 
so,  despite  carrying  out  the  additional  work  entailed  in  generative  learning  (Chandrashekar 
&  Granger  2012). 

The  CSL  mechanism  identifies  supervised  class  boundaries  as  a  side -effect  of  its  primary 
operation,  which  is  that  of  uncovering  structure  in  the  input  space  independent  of 
supervised  labels.  It  performs  solely  unsupervised  splits  of  the  data  into  similarity-based 
clusters.  The  algorithm,  described  in  detail  in  Chandrashekar  &  Granger  [2012),  is  as  shown 
below. 

The  method  constructs  a  class  tree  that  records  unsupervised  structure  within  the  data  as 
well  as  providing  a  means  to  perform  class  prediction  on  novel  samples,  as  per  supervised 
learning  tasks.  The  PARTITION  function  denotes  an  unsupervised  clustering  algorithm 
which  can  in  principle  be  any  of  a  family  of  clustering  routines.  The  function  SUBDIVIDE 
determines  whether  or  not  the  data  at  a  given  tree  node  qn  all  belong  to  a  single  labeled 

class;  if  not,  the  function  iterates  to  further  subdivide  the  node. 

This  deceptively  simple  mechanism  not  only  produces  a  supervised  classifier,  but  also 
uncovers  the  similarity  structure  embedded  in  the  dataset,  which  competing  supervised 
methods  such  as  SVMs  do  not  do.  Despite  the  fact  that  competing  algorithms,  including  SVM 
and  Knn  methods,  were  designed  expressly  to  obtain  maximum  accuracy  at  supervised 
classification,  we  have  presented  findings  indicating  that  even  on  this  task,  the  CSL 
algorithm  achieves  comparable  accuracy  while  requiring  significantly  less  computational 
resource  cost.  This  work  is  described  in  detail  in  [Chandrashekar  &  Granger  2012). 
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Input:  Dataset:  X  =  {xi  E  RM}  with  labels 

Y  =  {yi  €{1,2,3. JO} 

Output:  Class  Tree:  A  tree  rooted  at  the  node 
TRoot 

Init:  TRootX  =  X,  TRoot.Y  =  Y;  TRootXabels  = 
LABELSET(Y) 

Q  =  [];  Add(Q,  TRoot); 
white  Q  is  not  empty  do 
qn  =  First  node  In  Q 
if  SUBDIVIDE(Xqn,Yqn)  =  true  then 

[Centroids,  Clusters]  =  PARTITION  (Xqn/K*) 
foreach  Cluster  Ck  do 
Node  T 

T.X  =  Clusters[k] 

TXabels  =  L  ABELSET  (Y  (T.X)) 
qn.Branchesfk]  =  Centroids  [k] 
qn.Children[k]  =  T 
Add(Q,  T) 
end 
end 
end 

1:  CSL  Learning  Algorithm 


3.  Joint  localization  and  clustering 

Another  derived  method,  JLC,  is  a  generative  model  that  simultaneously  identilies  the  objects 
in  a  set  of  image  data  and  identilies  the  locus  of  those  objects  within  the  images.  The 
algorithm  searches  a  (preferably  very  large)  dataset  and  clusters  together  images  containing 
similar  neighboring  feature  groups,  this  identifying  the  occurrence  of  similar-appearing 
regions  across  the  images.  The  method  learns  the  feature  histograms  (using  just  a  simple 
bag  of  features  representation)  in  tandem  with  the  region  of  the  image  that  contains  that 
feature  set;  the  corresponding  region  is  designated  the  "foreground”  for  that  object  for  that 
image.  (Foregrounds  can  be  represented  in  either  of  two  ways:  as  bounding  boxes  or  as 
"superpixels”;  the  latter  are  comprised  of  bottom— up  unsupervised  segments  within  the 
scene.) 

The  method  completely  eliminates  the  need  for  labeling  of  images.  This  is  arguably  one  of 
the  most  time  consuming  and  expensive  components  of  image  processing.  The  intuition 
behind  the  approach  is  that  objects  can  be  viewed  as  recurring  foreground  patterns 
appearing  as  coherent  image  regions.  This  approach  has  been  used  in  several  other  studies 
such  as  semantic  latent  topic  models  for  image  clustering  (Russel  et  al.,  2006;  Fritz  &  Schiele 
2008). 

The  method  is  a  generative  model  of  "foreground”  formation  that  enables  simultaneous 
image  clustering  and  efficient  foreground  localization  via  maximum  likelihood  estimation. 
We  formulate  object  discovery  as  the  task  of  partitioning  an  unlabeled  collection  of  images 
into  K  subsets  (clusters)  such  that  all  images  within  each  subset  share  a  similar  foreground. 
In  order  to  obtain  a  method  scalable  to  large  collections  and  many  classes,  we  adopt  a 

3 

Approved  for  public  release;  distribution  unlimited. 


foreground  mask -based  representation  of  objects,  which  enables  fast  localization  given  the 
object  model.  We  do  not  commit  to  any  particular  bottom  up  segmentation  model.  Instead 
we  treat  the  foreground  mask  as  a  parameter  to  be  estimated  as  part  of  the  likelihood 
optimization.  We  demonstrate  that  this  leads  to  localization  and  image  clustering  that 
outperforms  competing  approaches  (Chandrashekar  et  al.,  2012).  We  view  each  object 
instance  as  a  random  variable  drawn  from  an  unknown  distribution  common  to  all 
instances  of  that  object  class.  This  common  distribution  assumption  constrains  all 
subwindow  histograms  of  an  object  class  to  represent  subtle  variations  around  a 
prototypical  average  histogram.  Based  on  this  assumption,  our  approach  poses  object 
discovery  as  a  maximum  likelihood  estimation  problem  to  be  optimized  over  the  collection 
of  unlabeled  images.  We  have  presented  a  method  that  maximizes  this  objective  by 
simultaneously  solving  for  the  histogram  model  parameters  of  the  object  classes,  detecting 
the  object  instances  of  each  class  in  the  unlabeled  images,  and  performing  a  soft  semantic 
clustering  of  images  in  the  dataset. 


4.  Object  discovery  in  videos 

The  work  on  joint  localization  and  clustering  operates  on  static  images,  yet  most  visual  input 
is  time  -varying  input,  whether  from  movies,  TV,  videos,  surveillance,  or  simply  everyday 
visual  experience.  Video  data  actually  adds  useful  constraints  to  the  object  recognition  task, 
via  inherent  temporal  consistency  across  neighboring  frames,  measurable  via  a  range  of 
optic  Mow  methods. 

The  video  work  extends  our  generative  model  of  static  object  formation.  The  method 
clusters  together  videos  that  contain  similar  objects;  here  we  combine  an  appearance 
model  as  well  as  a  local  optic-flowbased  Markov  model  into  a  single  objective  function 
defined  over  the  video  collection.  Since  the  Markov  model  is  local  to  a  given  video,  the  same 
object  class  can  be  in  different  movement  patterns  and  yet  still  contribute  to  the 
development  of  the  object  class  model. 

Learning  objects  from  videos  has  traditionally  been  attempted  in  the  form  of  fully- 
supervised  methods,  relying  on  structure  from  motion  or  propagating  belief  by  tracking 
during  testing.  Indeed,  if  the  task  is  fully  supervised  and  all  video  frames  are  fully 
annotated,  no  special  methods  are  required,  as  each  frame  is  itself  a  still  image.  As 
mentioned,  however,  generating  labels  is  time  consuming  and  requires  expert  human 
intervention;  the  problem  only  increases  when  faced  with  the  multiple  frames  per  second 
that  occur  in  videos. 
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5.  Related  work 

Supervised  methods  that  learn  to  recognize  and  segment  objects  in  videos  include  methods 
that  rely  on  structure  from  motion  (Brostow  et  al.,  2008;  Ladick  et  al.,  2010).  These  methods 
make  assumptions  about  characteristics  of  the  environment,  and  even  camera  angle,  that  do 
not  readily  generalize  to  real  world  datasets  with  unknown  camera  motion,  lighting  changes, 
poor  or  variable  resolution;  and  moreover,  they  suffer  from  the  laborious  necessity  of 
requiring  human  hand-labeling.  Typical  supervised  methods  for  video  segmentation  are 

interactive  (Bai  et  al.,  2009;  Price  et  al.,  2009),  requiring  input . again  time  consuming  and 

potentially  requiring  some  expertise . from  users. 

Unsupervised  video  segmentation  methods  include  motion  segmentation  (e.g.,  Malik  &  Shi 
1998),  which  cluster  pixels  in  video  using  bottom— up  motion  cues;  these  purely  bottom— up 
methods  are  highly  susceptible  to  variable  camera  motion  and  lighting  changes,  and  are 
unreliable  in  certain  object  motion  settings  (e.g.,  when  the  object  starts  and  stops).  Other 
methods  require  tracking  regions  or  "keypoints”  across  frames  (Brendel  &  Todorovic  2009; 
Brox  &  Malik  2010;  Vasquez-Reina  et  al.,  2010),  or  formulate  clustering  objectives  to  group 
pixels  from  all  frames  using  appearance  and  motion  cues  (Huang  et  al.,  2009;  Grundman  et 
al.,  2010).  (A  model  that  overcomes  some  of  these  drawbacks  (Lee,  Kim,  Grauman,  2011), 
is  set  up  as  a  pipeline  of  arbitrary  stages  of  processing,  and  its  properties  have  been  difficult 

to  characterize.)  None  of  these  methods  learn  any  foreground  appearance  model . i.e.,  a 

way  of  generatively  characterizing  the  learned  visual  objects. 

Many  approaches  have  used  spatiotemporal  feature  matching  to  process  video  datasets, 
particularly  for  gesture  recognition  (e.g.,  Dollar  et  al.,  2005;  Laptev  2005;  Niebles  et  al., 

2008;  Willems  et  al.,2008).  It  is  important  to  recognize  that  these  methods  typically  do  not 
generalize  to  object  recognition,  in  the  not-unusual  case  where  there  are  irregular  motions 
in  a  video  (e.g.,  an  object  moving  at  uneven  speed,  or  stopping  and  starting).  Furthermore, 
learned  spatiotemporal  models  typically  cannot  be  used  to  recognize  still  images,  since 
movement  is  integrally  represented  in  the  learned  model. 

In  contrast,  we  have  taken  an  approach  of  simultaneous  clustering  and  localization  of 
objects  in  unlabeled  videos  via  optimization  of  a  single  objective.  The  method  uses  both  an 
appearance  model  and  motion  model  as  constraints  in  the  search  for  object  foregrounds  in 
the  videos.  The  learned  appearance  model  operates  on  still  images  as  well  as  on  the  videos 
from  which  it  was  acquired. 


6.  Generative  model  for  unsupervised  object  discovery  in  videos 

The  figure  below  illustrates  the  generative  video  processing  model.  We  are  given  a  set  of  N 
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unlabeled  videos  z\,...,zpf  with  each  video  assumed  to  contain  one  of  K  objects  in  all  of  its 

frames.  Let  the  foreground  object  content  in  frame  j  of  video  i  be  described  by  z\ .  The 

twofold  objective  is  to  separate  the  videos  into  K disjoint  subsets  [clusters)  corresponding  to 
the  K  object  classes,  and  to  localize  the  object  within  every  frame  of  each  video. 


Foreground 


Ob&erved  video  frame 


Foreground  mask 


Mixing  toffieients  Ouster  label 


Background  parameters 


2:  Video  Processing  Model 


Let  xj  denote  the  (unknown)  foreground  mask  enclosing  the  object  of  z*  and  the 

foreground  mask  for  the  entire  video  /  is  then  x/ ,  a  sequence  of  random  variables  The 

foreground  region  corresponding  to  the  mask  for  the  frame  is  computed  as  the  un¬ 
normalized  histogram 

h(z’,x’)  6  Ms 

of  the  visual  words  (quantized  local  visual  features)  that  occur  inside  xf  (5  represents  the 

number  of  unique  words  in  the  visual  codebook  which,  as  usual,  is  learned  from  training 
images  during  an  offline  prior  stage;  more  about  this  will  be  discussed  later  in 
"developmental  learning”).  The  foreground  content  for  the  overall  video  is  computed  as  the 
average  of  the  content  of  the  foreground  regions  in  all  its  frames,  i.e 


where]  zj-  is  the  number  of  frames  in  zf  . 


\Zi\ 


For  an  object  class  k  in  a  selected  foreground  of  a  frame,  assume  a  multinomial  gaussian 
distribution  defined  by  parameters  0^  =  corresponding  to  this  object  class  (k)  in 

this  foreground.  For  members  of  that  kth  object  class,  the  distributions  H{zf,Xf )  and 
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h(zj,xjy  i-e-'  overall  (average)  foreground  for  the  video,  and  the  foregrounds  in  each  of 
the  constituent  frames,  are  generated  from  a  common  model  with  parameters  0^.  Let  label 
/  e  {1,. . K)  denote  the  unknown  cluster  label  of  video  z/ ,  which  we  assume  to  be  drawn 
from  a  multinomial  mixture  with  mixture  coefficients  n  .  Then  for  video  z;  with 

label  1 1 ,  its  foreground  histogram  H(z[ , x/ )  is  drawn  from  the  normal  distribution  with 
mean  and  covariance  p/  and  Z,  so 

H (zi  ,x/ )  □  JY  p/  ,Z/. 

Reducing  the  number  of  parameters  to  be  estimated  we  assume  the  covariance  Z^  of  each 
cluster  k  to  be  diagonal:  Z  %  =  diag(K/(i,...,K/cW)-\  Finally,  each  video  is  assumed  to  have  its 
own  independent  background  model  (defined  by  parameters  0£  )  which  can  be  left 

unresolved  for  the  current  object-discovery  objective.  The  figure  below  summarizes  the 
above  description  of  the  generative  model. 


We  maximize  the  likelihood  of  the  model  by  marginalizing  over  the  labels,  which  we  treat  as 

77 

hidden  variables.  I.e.,  our  objective  is  to  Hind  parameters  0  =  (0  ,TT}and  foreground  regions 
x  =  {xi ,. . .,  xn }  to  maximize: 

jV 

p(z\x,B)p(x)  = 

i  =  1 


which  can  be  expanded  to 

AF  if 

=  Y\^  =  kl Xi,e)p(Xi) 

i= 1  k= 1 

This  corresponds  to  a  video  extension  of  results  showing  a  generative  model  for 
unsupervised  object  discovery  in  still  images;  the  details  appear  in  the  accompanying  article 
(Chandrashekar,  Torresani  &  Granger,  2012).  As  in  that  work,  we  can  maximize  the 
proposed  penalized  likelihood  via  expectation  maximization  (EM),  alternating  between 
estimating  the  distribution  over  the  cluster  labels  If  for  each  video  zf ,  and  solving  for  the 
foreground  models  and  locations. 


If  we  treat  the  frames  of  a  video  as  still  images,  then  the  previous  work  (Chandrashekar  et  al., 
2012)  has  shown  how  the  foreground  can  be  localized  in  them.  We  have  subsequently 
derived  an  extended  method  whereby  not  just  the  appearance  model,  but  also  the  video’s 
optic-flow  constraints,  can  limit  the  search  for  object  foregrounds  in  video  frames.  Unlike  the 
appearance  model  which  is  applied  across  videos  in  the  dataset,  the  motion  model  is  applied 
only  within  each  video,  ensuring  that  videos  with  very  different  motion  signatures  for  the 
same  objects  can  still  contribute  to  the  appearance  model;  i.e.,  the  learned  appearance  model 
generalizes  over  different  motion  signatures. 
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Just  as  in  our  previous  object  recognition  and  localization  methods,  we  can  use  two 
distinct  methods  for  foregrounding:  either  rectangular  bounding  boxes,  or  bottom— up 
segmentation  via  superpixels. 


7.  Developmental  learning 

All  of  the  methods  described,  as  well  as  cited  work  from  other  labs  that  is  referenced,  contain 
a  dependency  on  initialization,  which  can  also  be  thought  of  as  identifying  and  setting  priors. 
For  unsupervised  learning,  the  initialization  typically  has  a  substantive  impact  on  the  quality 
of  the  final  results.  The  parameters  requiring  initialization  in  our  model  are:  mixture 
coefficients  ( TT £ );  histogram  means  ( |Jk  )  and  variances  ( Ik  );  and  foreground  masks  for  all 

video  frames  (je^).  Our  first  versions  of  these  models  initialized  foreground  masks  by 

matching  all  pairs  of  images,  performing  a  co-segmentation,  which  is  an  expensive  process. 
Even  this  step  was  used  only  for  stills;  it  was  dispensed  with  in  the  case  of  videos,  in  favor  of 
using  motion-based  segmentation  to  get  initial  estimates  of  foreground  masks. 

A  new  method  has  been  developed  to  use  large  amounts  of  unlabeled  data  to  automatically 
acquire  approximations  of  priors.  The  method  is  generative  and  generates  multiclass 
classifications  [both  as  opposed  to  discriminative  methods  such  as  SVMs). 

The  intuition  is  that  of  a  "developmental”  stage  in  which  the  system  uses  a  specialized  set  of 
rules  on  masses  of  otherwise  uninterpreted  data,  to  generate  an  initial  tree  that  will 
correspond  to  a  large  vocabulary  of  features  and  collections  of  features.  These  then  will  be 
used  from  then  on,  in  what  an  be  thought  of  as  the  subsequent  "adult”  phase  of  the  system, 
for  the  tasks  we  have  been  studying  (recognition,  classification,  localization).  The  data 
structure  acquired  by  the  method  is  a  hierarchy  that  contains  information  about  vocabulary 
features  and  their  relations  to  each  other;  it  is  termed  a  branching  object-relation  notation 
(BORN). 

These  trees  are  intuitively  related  to  "vocabulary  trees"  of  the  kind  described  by  Nister  & 
Stewenius  (2006),  for  instance:  tree  structures  that  capture  a  vocabulary  of  image  features 
in  a  hierarchical  form  obtained  by  recursive  application  of  K— means.  We  treat  images  as 
collections  of  objects,  embedded  in  various  settings.  The  objects  are  represented  in  the 
BORN  structures  which  are  constructed  from  very  large  collections  of  unlabeled  images. 

If  the  task  is  image  retrieval  alone,  low-level  feature-based  representations  may  suffice,  but 
for  tasks  of  recognition  and  localization,  richer  representations  will  be  beneficial.  From  a 
collection  of  unlabeled  data  ( U)  we  identify  object-like  regions  (via  a  published  set  of 
methods  for  images,  and  via  those  methods  supplemented  with  optic  flow  information  for 
videos).  That  collection  of  regions,  0,  is  represented  using  simple  bag-of-visual-words 
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(bovw)  modeling,  where  h(o)  is  the  histogram  of  features  for  object  region  o  £  O.  We 
recursively  cluster  this  collection  (similar  to  Nister  &  Stewenius  2006)  by  applying  a 
gaussian  mixture  model  on  the  dataset  to  organize  the  appearance  based  clusters  in  the 
form  of  a  tree. 


At  each  node  t  of  the  BORN  tree,  we  have  a  collection  of  object  regions  O *  .  We  assume  a 
generative  framework  in  which  h(a  )is  modeled  as  a  random  variable  drawn  from  a 


gaussian  distribution  with  parameters  Qj  =  £f}  i.e.,  the  histogram  is  related  to  a 

normal  multinomial  gaussian: 


*(0*)  -JYivt'ZD 


where  /  e  {1,. . K]  denotes  the  (unknown)  cluster  label  of  region  o  at  node  t  in  the  tree.  The 
label  /  is  assumed  to  be  drawn  from  a  multinomial  distribution  with  mixture  parameters  n  = 

{ji 

We  again  use  the  EM  algorithm  to  maximize  the  likelihood  of  the  model  by  marginalizing  over 
the  cluster  labels;  the  likelihood  function  is: 


K 


For  each  cluster,  a  new  child  node  is  created  in  the  BORN  tree  under  node  t.  The  cluster 
means  and  the  variances  (Q*  ^jfor  the  child  nodes  are  recorded  and  each  cluster  is  further 

subdivided  using  the  same  process,  continuing  until  a  maximum  allowed  tree  depth.  The 
algorithm  for  producing  the  branching  object -region  notion  tree  is  stated  below: 
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Input:  U  -  A  set  of  unlabeled  images. 

Output:  BORN,  rooted  at  r 
Z  =  0 

for  image  i  6  [/do 

Oi  =  Object_Region_Detector(  i) 

Z  <—  ZuOi 

end 

Init:  Zr  =  Z 

Q  =  W; 

while  Q  7^  0  do 
t «—  a  node  in  Q 

Q  ■>—  Q  —  t 

if  SUBDIVIDE! Zl)  =  true  then 
Wl.K,  C\„k]  =  GMM  (Z\  K) 
foreach  Cluster  Ck  do 
Node  c 

Zc  =  Ck 

B{  =  ek 

Q<-QUc 

end 

end 

end 

3:  Branching  object-recognition  tree 

Once  a  BORN  structure  has  been  built  from  a  large  unlabeled  dataset,  we  can  use  it  to 
perform  image  retrieval.  Every  image  in  the  retrieval  dataset  is  encoded  via  the  BORN  tree. 
The  similarity  between  a  query  image  and  an  image  in  the  database  is  determined  by 
comparing  the  paths  that  are  taken  through  the  tree,  by  object  regions  from  the  images. 


Each  object  region  in  an  image  is  first  described  as  a  histogram  (using  bag-of--- 
visual-words  notation).  Then  the  similarity  between  two  images  q  and  d  can  be 


computed: 


Qi  =niwi 
dt  =miwi 


Wj  -  In — 
Ni 


where  wf  is  the  weight  of  each  node  i  in  the  tree.  The  variables  rif  and  mi  are  the  number 

of  descriptor  vectors  of  the  query  and  the  database  image  respectively,  with  a  path  through 
node  /  in  the  tree.  N  is  the  total  number  of  images  in  the  database,  and  Nf  is  the  number  of 

images  that  have  at  least  one  object  region  passing  through  node  /. 


Once  a  BORN  tree  has  been  constructed,  subsequent  learning,  which  corresponds  to 
normal  learning  approaches  such  as  learning  label  trees,  can  be  thought  of  as  "adult” 
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learning,  using  the  results  of  the  developmental  creation  of  the  BORN  tree  as  setting 
priors  for  the  adult  stage  (and  thus  substantially  improving  both  learning  rate  and 
accuracy).  Our  method  for  object  detection  and  classification,  described  earlier  (and  in 
Chadrashekar  and  Granger  2012),  can  be  modified  to  use  the  BORN  representation, 
constructing  a  label  tree  as  a  subgraph  of  the  BORN  tree. 

Classifiers  that  learn  labeled  trees  have  been  shown  to  be  more  efficient  than  typical 

approaches  that  learn  l-versus---rest  classifiers,  as  well  as  reducing  recognition  time . 

often  to  order  log  of  the  size  of  the  learned  labeled  tree  (Deng  et  al  2011;  Bengio  et  al., 
2010).  All  such  methods,  however,  perform  splits  of  the  data  by  supervised  means, 
learning  hyperplanes.  In  contrast,  we  introduced  methods  using  unsupervised  clustering 
and  localization  via  maximization  of  a  single  likelihood  objective  (Chandrashekar  & 
Granger  2012;  Chandrashekar  et  al.,  2012).  In  particular,  we  constructed  labeled  trees  via 
purely  unsupervised  splits  of  the  data,  iteratively  “purifying"  the  clusters  according  to 
whether  the  supervised  labels,  within  the  unsupervised  clusters,  were  consistent. 
Combining  the  two  methods  above  can  be  seen  as  providing  a  natural  framework  for 
performing  joint  recognition  and  localization  to  construct  hierarchical  representations; 
moreover,  the  method  is  generative  and  thus  can  be  used  for  multiclass  classification. 


The  constructed  labeled  tree,  as  a  subset  of  the  BORN  tree,  contains  only  nodes  where  N t 
>  0 ,  i.e.,  all  nodes  in  the  BORN  tree  through  which  labeled  training  data  has  traversed. 
Thus  the  primary  task  for  adult  learning  is  at  each  node  t,  the  labeled  data  is  to  be  divided 
into  K  clusters.  This  is  accomplished  by  performing  a  generative  clustering  and 
foreground  localization  task  by  maximizing  the  objective  function  specilied  in 
Chandrashekar,  Torresani  &  Granger  (2012).  The  clusters  created  are  treated  as  child 
nodes  of  BORN  tree  nodes,  with  gaussian  foreground  parameters  stored  at  the  branch.  We 
then  examine  each  cluster  to  see  if  it  needs  to  be  further  split,  via  an  algorithm  very 
similar  to  that  described  earlier  for  the  CSL  algorithm: 
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Input:  BORN,  Dataset:  Z  <—  {<  Zi,  yi  >}  with  labels  Y 

Output:  Label  Tree  LT 

Init:  t  <—  root  node  of  BORN; 

Z^Z 
Q  <—t 

while  Q  ^  0  do 

t  <—  node  in  Q 
Q  <-  Q  -t 
Yt  <-  Y(Zf) 

Nl  <-  \Z*\ 
if  |y4|  >  1  then 

Init:  xl,  and  using  node  t  e  BORN 
Compute  O1,  xl  by  maximing  C 
fori  =  1  :  \Bl\  do 
c  <—  Bj 

if  Ci  7^  0  then 

Zc  =  Ci 
B\  =  et 

Q  <-  QUc 

end 

end 

end 

end 


4:  Constructed  labeled  tree  algorithm 


foG  {1,2,3 ..K}} 


The  process  of  building  BORN  representations  developmentally,  and  using  them  for 
embedding  labeled  trees  in  subsequent  adult  learning,  is  illustrated  here: 


Object  region  bovw 


Large  un labeled  image  collection 


Scalable  object  region  tree  _ 

Labeled  image  dataset 


Label  Tree 


5:  BORN  representation 


In  sum,  this  has  been  an  initial  proof  of  principle  effort  to  investigate  new  approaches  to 
processing  images,  with  an  emphasis  on  moving  images.  Little  prior  work  has  been  done  on 
the  processing  of  purely  unlabeled  images,  let  alone  unlabeled  video  images,  despite  the  fact 
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that  the  task  of  labeling  data  is  among  the  most  expensive  (human -  intensive)  components  of 
image  processing.  We  studied  the  development  of  novel  methods  for  visual  processing  of 
images  and  moving  scenes.  The  work  has  substantially  exceeded  the  initial  proof  of 
principle  aims:  we  have  developed  novel  algorithms  for  object  recognition  and  localization 
in  both  still  images  and  in  videos,  and  have  carried  out  initial  evaluations  comparing  the 
new  methods  with  previous  approaches  in  the  literature.  The  results  have  been  highly 
promising  and  already  have  led  to  two  publications  (as  well  as  a  review  paper).  Ongoing 
work  is  aimed  at  extending  the  initial  findings  to  include  a  suite  of  advanced  approaches  to 
the  task  of  processing  time-varying  images. 
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Although  bt^ri  cpc tits  p re sumabfy  ca  ny  out  po^rful  perceptual  algoirthms.  Few  irrstaric©£ 
of  derived  biological  methods  have  been  found  to  compete  favorably  against  algotithms 
that  have  been  engineered  for  specific  applications. We  forward  a  novel  analysis  of  a  subset 
off  functions  off  conical- subcortical  loops,  which  constitute  more  than  B0%  of  the  human 
brain,  thus  likely  underlying  a  broad  range  of  oo^iitive  Functions.  We  describe  a  family  cl 
operations  performed  by  trie  derived  method,  inducing  a  non-standard  method  for  supeF 
vised  classification,  which  may  underlie  some  forms  of  conically  dependent  associative 
learning.  The  revel  supervised  classifier  is  compared  against  widely  used  algorftims  for 
ciassificatien,  including  support  vector  machines  (SYM)  and  k-nearast  neighbor  methods, 
achieving!  corresponding  dassifTcation  rates  -  at  a  fraction  off  tiie  time  and  space  costs. 
This  represents  an  instance  of  a  bioldgicalty  derived  algorithm  comparing  Favorably  against 
widely  used  machine  learning  methods  on  well-studied  tasks. 
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1.  INTRODUCTION 

Distinct  brain circuit designs  exhibit  different  functions  In  human 
(and  ether  animal':  brains.  Particularly  notable  are  studies  of  the 
basaJ  ganglia  I'strlatal  compkx'!.  which  have  arrived  at  closciy- 
ndabed.  hypotheses,  from  Independent  laboratory  that  the  system 
carries  out  a  Form  of  reinforcement  Learning  {Sutton  and  Barto, 
1990;  Schultz  el  aL,  1997?  Schultz,  2 CO 2;  Daw,  ZOOS  IT  Doherty 
cL  al.,  2003;  Daw  and  Daya,  2-DOE);  despite  ongoing  differences 
In  the  particulars  of  these  approaches,  their  overall  findings  are 
surprisingly  concordant,  corresponding  to  a  still -fine  Instance  of 
convergent  hypotheses  of  the  computations  produced  by  a  partic¬ 
ular  brain  circuit.  Models  of  thalamocortical  circuitry  haw  not  yd 
converged  to  functional  hypotheses  that  are  as  widely  agreed-on, 
but  several  different  approaches  nonetheless  hypothesize  thie  abil¬ 
ity  off  thalamocorUcal  circuits  to  perform  un supervised  learning, 
discovering  structure  In  data  (Lee  and  Mum  ford,  2003;  Rodriguez 
et  al»2DG4;  Granger,  2006;  George  and  Hawkins,  2009).  Yet  tha!- 
amocortlca]  and  striatal  systems-  do  not  typically  act  Ln  Isolation; 
they  are  tightly  connected  Jn  cortSco-str lata]  loops  such  that  vlrtu- 
atiy  each  cortical  area  Interacts  with  corresponding  striata]  regions 
(Kemp  and  Powell.  1 97 L;  Alexander  and  Delong,  I9fl5; McGeorge 
arid  Fault,  I9SB).  The  resulting  cortlco- striata]  loops  constitute 
more  than  BO%  of  human  brain  circuitry  {Stephan  et  al,  L  970, 
I9B  J ;  SLephan,  1 972h  suggesting  that  Lbcl  r  operation  provides  the 
underpinnings  of  a  very  broad  range  of  cognitive  functions. 

We  forward  a  new  hypothesis  of  thE  Interaction  between  cor- 
Llcai  and  striatal  circuits,  carrying  out  a  hybdd  of  unsupervlsed 
hierarchical  teaming  and.  reinforcement,  together  achieving  a 
cortEco-strlataJ  loop  algorithm  that  performs  a  number  of  dis¬ 
tinct  operations  of  computational  utility,  Including  supervised 
and  unsupcrvlsed  classification,  search,  object  and  feature  Local¬ 
ization,  and  hierarchical  memory  organization.  For  purposes  of 


the  present  paper  we  focus  predominantly  on  the  particular  task 
of  si^KTVlsed  learning, 

Tradition  al  supervised  Learning  methtods  typically  identify  class 
boundaries  by  focusing  primarily  on  theclass  labels,  whereas  on  su¬ 
pervised  methods  discover  similarity  structure  occurring  within  a 
dataset;  two  distinct  tasks  with  separate  goals,  typically  carried  out 
by  distinct  algorithmic  approaches. 

Widely  used  supervised  classifiers  such  as  support  vector 
machines  (VapnJk,  1995),  supervised  neural  no! works  {Bishop, 
1996),  and.  decision  trees  (Erelman  et  all.,  l9Sd;  BuntJne,  1 992),  are 
so-called  discriminative  models,  which  learn  separators  between 
categorlesof  sample  data  without  Learning  the  data  Itself,  and  with¬ 
out  illuminating  the  similarity  structure  with  In  the  data  set  being 
classified. 

The  cortlco- si  natal  loop  fCSL)  algorithm  presented  here  Ls 
‘■generative."  Le_,  it  Is  In  the  category  of  algorithms  that  models  data 
occurring  within  each  presented  class,  rather  than  seeking  solely 
to  Identify  differences  between  the  classes  {as  would  a  “dlscrlm  I- 
nathe”  method).  Generative  models  areoftem  taken  as  performing 
excessive  work  in  cases  where  Ihe  only  point  Is  to  distinguish 
among  Labded  classes  (Mg  arid  Iordan,  2M2).  The  CSL  method 
may  thus  be  taken  as  carryingout  mare  tasks  than  classllkatlon, 
which  we  Indeed  will  see  it  does.  Nonetheless,  we  observe  the 
behavior  of  the  algorithm  Ln  the  task  of  classification,  and  com¬ 
pare  it  agalnsl  discriminative  classifiers  such  as  support  vectors, 
and  find  that  even  In  this  restricted  ('though  very  widely  used) 
domain  of  application,  the  CSI.  method  achlevescomparable  clas¬ 
sification  as  dlscrtmlnaUrc  models,  and  uses  far  Less  computational 
cost  to  do  sot  despite  carrying  out  the  additional  work  entailed  in 
generative  learning. 

The  approach  combines  the  two  distinct  tasks  of  unsupervlsed 
classllkatlon  and  reinforcement,  producing!  novel  method  for  yet 
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another  task;  that  of  supervised  learning.  The  new  method  iden¬ 
tifies  -supervised  class  boundaries*  as  a  byproduct  of  uncovering 
structure  Lu  the  input  space  that  is  Independent  of  the  super¬ 
vised  tobelSv  It  performs  solely  unmpervlsed  splits  of  the  data  Into 
similarity-based  clusters.  The  constituents  of  each  subdustcr  are 
checked  to  see  whether  or  not  they  all  belong  to  the  same  Intended, 
supervised  category.  If  not,  the  algorithm  makes  another  on  su¬ 
pervised  spht  of  the  cluster  Into  stix lusters,  Iteratively  deepening 
the  class  tree.  The  process  repeats  enlil  all  dusters  coma  In  only  (or 
largely)  membens-of  a  single  supervised  class.  The  result  Is  the  con¬ 
struction  of  a  hJerarcbyof  mostly  mixed  classes,  with  the  leaves  of 
the  tree  bef  ng'pure’’categorJes*  Le„  those  whose  member  s  contain 
only  tor  mostly)  a  single  shared  supervised  class  Labd. 

Some  key  character IsUcs  of  the  method  are  worth  noting. 

*  Only  unsupervtsed  splits  are  performed*  so  clusters  always 
contain  only  members  that  are  similar  to  each  other. 

*  In  the  case  of  similar-looking  data  that  belongto  distinct  super¬ 
vised  categories  (eg.*  similar-looking  terrains*  one  leading  to 
danger  and.  one  to  safety),  these  data  will  constitute  a  diffi¬ 
cult  discrimination;  Le.*  they  will  reside  near  the  boundary  that 
partitions  the  space  into  supervised  classes. 

*  In  cases  of  similar  data  with  different  class  Labels*  he,  difficult 
discriminations,  the  method  will  likely  perform  a  succession 
of  unsupervised  splits  before  happen  ing  on  one  that  splits  the 
dangerous  terrains  Into  a  separate  category  from  the  safe  ones. 

En  other  words,  the  method  will  expend  more  effort  In  cases  of 
difficult  discriminations,  (This  character  IsUc  is  reminiscent  of  the 
mechanism  of  support  vectors,  whldi  Identify  those  vectors  near 
the  intended  partition  boundary,  attemptlngto  place  the  boundary 
so  as  to  maximize  the  distance  from  those  vectors  to  the  bound¬ 
ary.)  Moreover,  En  contrast  to  supervised  methods  that  provide 
expensive*  detailed  error  feedback  at  each  training  slept  (Instruct¬ 
ing  the  method  as  to  which  supervised  category  the  Input  should, 
have  been  placed  In),  the  present  method  uses  feedback  that  is 
comparatively  far  more  Inexpensive,  consisting  of  a  single  bit  at 
each  training  step,  telling  the  method  whether  or  not  an  unsuper- 
vlsed  cluster  Is  yet  'pure'’;  If  so,  the  method  slops  for  that  node;  If 
not,  the  method  performs  further  unsupervised  splits. 

This  deceptively  simple  mechanism  not  only  produces  a  super¬ 
vised  classifier,  hut  also  uncowrs  the  sfmllarty  structure  embed¬ 
ded  In  the  dataset,  which  competing,  supervised  methods  do  not. 
Despite  the  fact  thatcompetingalgorlthms  (such  as  SYM  and  Knn) 
were  designed  expressly  to  obtain  maximum  accuracy  at  super¬ 
vised  classification,  we  present  findings  indicating  that  even  on 
this  task*  the  CSL  algorithm  achieves  comparable  accuracy,  while 
requiring  significantly  less  computational  resource  cost. 

In  sum,  the  CSL  algorithm,  derived  from  the  Interaction  of 
cortko-strlalal  loops*  performs  an  unorthodox  method  that  rivals 
the  best  standard  methods  In  classification  efficacy,  yet  does  so  In 
a  fraction  of  the  lime  and  space  required  by  competing  methods. 

2  CORflCO^TFlIATAL  LOOPS 

The  basil  ganglia  (striatal  complex)*  present  In  reptiles  as  well  as 
In  mammals*  is  thought  to  carryout  same  form  of  reinforcement 
learning*  a  hypothesis  shared  across  a  numberof  laboratories  (  but¬ 
ton  and  Barto,  1 9 Ht  Schultz  el  ai*  1 997;  Schultz,  2DQ2;  Daw*  2003s 


O'Doherty  ei  aL*  2003;  Daw  and  Doyn,  2006).  The  actual  neural 
mechanisms  proposed  Involve  action  selection  through  a  maxi¬ 
mization  of  the  corresponding  reward  estimate  for  the  action  on 
the  task  (see  Brown  el  aL,  1 999;  Gurney  el  al*2DOr+  Daw  and  Doya, 
2006;  Leblols  et  aL*  2006;  Houk  et  aL,  2007  for  a  range  of  on 
action  selection).  This  reward  estimation  occurs  In  most  modds 
of  the  striatum  through  the  regulation  of  the  output  of  the  ikuto- 
transmltter  dopamine.  Therefore,  m  computational  terms  we  can 
charactedze  the  functions  I  fly  of  the  striatum  as  an  abstract  search 
through  the  space  of  possible  actions,  guided  by  dopaminergic 
feedback. 

The  neocortex  and  thalamocortical  loops  are  thought  to  hier¬ 
arch  Lc  ally  or  ganlae  comp  lex  fact  and  event  Information,  a  hypoth¬ 
esis  shared  by  multiple  researchers  (Lee  and  Mumford,  2003; 
Rodriguez  el  aL*  2D0*  Granger,  2006;  George  and  Hawkins,  2009). 
Bor  Instance*  En  Rodrigues  el  ii  (20041  the  anatomically  recog¬ 
nised  “core"  and  “matrix"  siijcJrcults  are  hypothesized  to  carry 
out  farms  of  unsupervised  hierarchical  categorization  of  static 
and  time-varying  s^ials;  and  In  Lee  and  Mumford  (2003 ’!„  George 
and  Hawkins  (2009)*  Rlesenhuber  and  Rogglo  f  L  99$. ,  and  Ullman 
(2006)  and  many  others,  hypertheses  are  forwarded,  of  how  corti¬ 
cal  circuits  may  construct  cam  put  at  Ion  al  hierarchies;  these  studies 
from  different  labs  propose  related  hypoLheses  of  thalamocortical 
cl  routs  performing  hierarchical  categorization. 

It  Is  widely  accrued  that  these  two  primary  tdencephallc  struc¬ 
tures,  cortex  and  striatum,  do  not  act  In  Isolation  In  the  brain; 
they  work  In  tight  coordination  with  each  oilier  (Kemp  and  Pow- 
ell,  1971;  Alexander  and  DeLong,  1965;  McGeorge  and  FiuJI. 
L9BS).  The  ubiquity  of  this  repeated  architecture  (Stephan  et  al., 
L970*  1 9S  1;  Stephan*  1971)  suggests  that  cortlcn-strlatal  circuitry 
underlies  a  very  broad  range  of  cognitive  functions.  In  particular, 
II  Is  of  Interest  to  determine  how  semantic  cortical  Informa¬ 
tion  coidd  provide  top- down  constraints  on  otherwise  too-brood 
-search  during  (stnalal)  reinforcement  learning  (Granger*  201 B). 
In  the  present  paper  we  study  this  Interaction  In  terms  of  subsets 
of  the  leading  extant  computational  hypotheses  off  the  two  com¬ 
ponents:  thalamocortical  circuits  for  unsupervlxd  learning  and 
the  basal  gang]  I  a)'st  natal  complex  for  reinforcement  of  matches 
and  mismatches.  If  these  bottom-up  analyses  of  conical  and  stri¬ 
atal  function  are  taken  seriously*  It  Ls  of  Interest  to  study  what 
mechanisms  may  emerge  from  the  interaction  of  the  two  mech¬ 
anisms  when  enpged  In  (anatomically  prevalent)  cortlco-striataJ 
loops.  We  adopt  straightforward  and  tractable  simplifications  of 
these  models,  to  study  the  operations  that  arise  when  the  two 
are  interacting.  Figure  1  Illustrates  a  hypothesis  of  the  func¬ 
tional  Interaction  between  unsupervised  hierarchical  clustering 
iuhe  cortex)  and  match -mismatch  re  in  for  cement  (mm;  striatal 
complex),  constituting  the  Integrated  mechanism  proposed  here. 

The  Interactions  in  the  simplified  algorithm  are  modeled  In 
part  on  mechanisms  outlined  In  Granger  (2006):  a  simplified 
model  of  thalamocortical  circuits  produces  tin  supervised  dus¬ 
ters  of  the  Input  data;  then*  En  the  CSL  model,  the  result  of  the 
clustering*  along  with  the  corresponding  supervised  labels,  are 
examined  by  a  simplified  model  of  the  striatal  complex.  The  full 
computational  models  of  the  thalamocortical  hierarchical  clus¬ 
tering  and  sequencing  cErcu  El  and  striatal  re  I  n  to  rce  men  t  -  Learn  i  n  g 
ci  rcuit  yield  interactions  that  are  under  ongoing  study,  and  will,  It 
Is  hoped*  lead  bo  further  derivation  of  additional  algorithms.  For 
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FIGURE  1 1  Eknplrftsd  hypolhfiiis  oi  tha  [OffnpvUnu  nF 
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r-agad  bo  lid:.  IRar  r  diti  “ha  mriEh-nETBildi  maJiaiiari  hr? .mi  mridHS 
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TBfl  bp-  cra:bnc  subduriara  "e.g..  oarbrsl  node  in  lha  iagr.arri  these  n 
tirn  ait  d~edtad  lor  bbal  core  dancy.  as-  brier  a.  Tha  proons  :ea:e:  until 
tha  laai  ncriss  x  tha  uiEupetvsad  tma  :ontar  only  c:teuny  members  of 
a  artg  elite  CSeeteortl. 


the  present  paper,  we  use  just  5 rr  all  subsets  of  Hue  hypothesized 
functions  of  these  structures  solely  the  hypothesized,  hierarchical 
clustering  function  of  the  thalamocortical  circuit,  and  a  wry- 
reduced  subset  of  the  relnforcement-leamlng  capabilities  of  the 
strlaUJ  complex,  such  that  It  does  nothing  more  than  compare 
(match  mismatch?  the  ccntemts  of  a  proposed  category,  and 
return  a  single  bit  corresponding  to  whether  the  contents  all  have 
been  labeled  as  ‘matching’  each  other  0)  or  not  (D?.  This  wry- 
reduced  RL  mechanism  can  be  thought  of  simply  as  rewarding 
or  punishing  a  category  based  on  [Is  constituents.  In  particu¬ 
lar  the  proposed  simplified  slrialal  mechanism  returns  a  single 
bit  (correct/ Incorrect)  denoting  whether  the  members  of  a  given 
unsnpervlsed  duster  alt  correspond  to  tbE  same  supervised"  labeL" 
If  nob  tbE  system  returns  a  “no"  f"-*)  Be  the  unsupervised  cluster¬ 
ing,  mechanism,  which  in  turn  Iterates  ever  the  duster  producing 
another,  sUll  on  supervised,  set  of  subchisters  of  the  “impure" 
cluster.  The  process  continues  until  each  unsupervlsed  subcluster 
contains  members  oniy  (or  mostly*  In  a  variant  of  the  algorithm) 
of  a  sln^e  category  label. 

In  sum*  the  mechanism  uses  only  unsupen  lsed  categorization 
operations*  together  with,  category  membership  tests.  These  two 
mechanisms  result  In  the  eventual  iterative  arrival  at  categories- 
whose  members  can  be  considered  In  terms  of  supervised  classes. 

Since  only  unsupervEsed  splits  are  performed,  categories  (clus¬ 
ter!?  always  contain  only  members  thal  are  similar  to  each  other. 
The  tree  may  pnerate  multiple  terminal  leaves  cones  pond  Eng  to 
a  given  class  label;  In  such  cases,  the  distinct  leaves  correspond 
to  dissimilar  class  subcategories,  eventually  partitioned  into  dis¬ 
tinct  leaf  nodes.  The  mechanism  can  halt  rapidly  If  ail  supervised, 
classes  correspond  to  sfmlJartly-based.  dusters;  be.,  If  class  labels 


are:  readily  predictable  from  their  appearance:  This  corresponds 
to  an  “easy”  discrimination  task.  WbEn  this  Is  not  the  case,  Le.,  In 
Instances  where  similar-looking  data  belong  to  different  labeled 
categories  (c.g.,  similar  mushrooms,  some  edible  and  some  poiso¬ 
nous?,  the  mechanism  will  be  triggered  to  successively  subdivide 
clusters  Into  subclusters,,  as  though  searching  for  the  characteristics 
that  effectively  separate  the  members  of  different  labels. 

In  odier  words*  less  wort  Is  done  for  “easy"  discriminations; 
and  only  when  there  are  difficult  discriminations  will  the  mech¬ 
anism  perform  additional  steps.  The  tree  becomes  intrinsically 
■unbalanced  as  a  function  of  the  lumpiness  of  the  data:  branches 
of  the  tree  are  only  deepened  En  regions  of  the  space  where  the 
d  I  serf  ml  natio  ns  are  difficult,  Le.,  where  members  of  two  or  more 
distinct  supervised  categories  are  dose  to  each  other  In  the  input 
space.  This  property  Is  reminiscent  of  support  vectors*  whldi  iden¬ 
tify  boundaMes  In  the  region  where  two  categories  are  closest  (and 
thus  where  the  most  difficult  discriminations  occur?. 

A  final  salient  feature  of  the  mechanism  Is  its  cost.  In  con¬ 
trast  to  supervised  methods*  which  provide  detailed,  expensive, 
error  feedback,  art  each  training  step  (telling  the  Extern  not  only 
when  a  mlselas-slfkatlon  has  been  made  but  also  exactly  wbidi  class 
should  haw  occurred?,  the  present  method  uses  feedback  that  by 
comparison  Is  extremely  Inexpensive  ..conslsUng  of  a  single  bit,  cor¬ 
responding  to  either  “pure"  or  “Impure'’du5lef.s.  For  pure  dusters, 
the  method  halts  for  Impure  clusters*  the  mechanism  proceeds  to 
deepen  the  hierarchical  tree. 

As  mentioned,  the  method  Is  generative*  and  arrives  at  rich 
models  of  the  learned  Input  data.  It  also  produces  multiclass 
partitioning  as  a  natural  consequence  of  Its  operatloci,  unlike 
dlscrtmlnatlw  supervised  methods  which  are  inherently  binary, 
requiring  extra  mech  am  sms  to  operate  on  multiple  classes. 

Overall,  thlsdeceplllwly  -simple  mechanism  not  only  produces 
a  supervised  classifier*  but  also  uncovers  the  similarity  structure 
embedded  In  thedataset*which competing  supervised  methods  do 
not.  The  terminal  leaves  oF the  tree  provide  final  class  Information, 
whereas  the  In  tern  al  nodes  provide  further  information:  they  are 
mixed  categories  corresponding  to  meta  labels  (e.g,h  superordinate 
categories;  these  also  can  provide  information  about  which  classes 
are  likely  to  become  confused  with  one  another  during  testing. 

In  the  next  section  we  provide  an  algorithm  that  retains  func¬ 
tional  equivalence  with  the  biological  model  for  supervised  learn¬ 
ing  described  above  while  abstracting  out  the  implementation 
details  of  the  thalamocortical  and  striatal  circuitry.  Simplifying 
the  .Implementation  enables  investigation  of  the  algorithmic  prop¬ 
erties  of  the  model  independent  of  its  Implementation  details 
(Marr,  I  PSO?.  It  also,  importantly,  allows  us  to  test  our  model  on 
real-world  data  and  compare  directly  against  standard  machine 
learning  methods.  Using  actual  thalamocortical  circuitry  to  per¬ 
form  the  tuisupervEsed  data  clustering  and  the  mechanism  for 
the  basal  ganglia  to  provide  reinforcement  feedback*  would  be 
an  Interesting  Lask  for  the  distinct  goal  of  Investigating  potential 
ImplemCTlafLJon-kvet  predictions;  this  holds  substantial  potential 
for  future  research. 

We  emphasize  that  our  focus  Is  to  use  existing  hypotheses  of 
telencephailc  component  Junction  already  posited.  In  tbellter  ature; 
these  mechanisms  lead  us  to  specifically  propose  a  novel  method 
by  which  supervised  learning  is  achieved,  by  the  unlikely  route  of 
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ccmbinlngunsupervlsed  learning  with  reinforcement.  This  kind 
of  com  pulatlonaL- 1  eve!  abstraction  and.  analysis  of  biological  eitl- 
Llcs  continues  In  the  tradition  of  many  prior  works.  Including 

Sun  and  Schultz:  (200  l)t  Sdmltz  (2DG2J*  Daw  and  Doya  (ZDfMl, 
Lee  and  Main  to  rd  (2003l>  Rodriguez  et  iJ.  (2C04)*  George  and 
Hawkins  (2009)*  Marr  (lWO),  Rlesenbuber  and  Foggfo  (1999), 
ULman  (2006)*  and  many  others. 

3.  SIMPLIFIED  ALGORITHM 

En  our  simplified  algorithm*  we  refer  to  a  method  which  we 
term  PARTITION ,  corresponding  to  any  of  a  family  of  cluster¬ 
ing  methods,  Intended  to  capture-  the  clustering  functionality  of 
thalamocortical  loops  os  described  In  the  previous  sections;  and 
we  refer  to  a  method  we  berm  SURD[VIDEl  corresponding  lo 
any  of  a  family  of  simple  net  n  force  men  t  methods,  intended  to 
capture  the  reinforcement-learning  function alityof  the  basal  gin- 
glla/strtatal  complex  as  described  In  the  previous  sections.  These 
operate  toother  In  an  Iterative  loop  corresponding  to  cartleo- 
■strlalJal  (duster-reinforcement)  Interaction:  SLTBDJYFDjE  checks 
for  the 'terminating’’ conditions  of  the  Jteratlw  Hoop  by  examin¬ 
ing  the  labels  of  tbE  constituents  of  a  given  cluster  and  reluming 
a  I  me  or  Jdlse  response.  The  resulting  training  method  builds  a 
tree  of  categories  which,  as  wilt  be  seen,,  has  the  effect  of  per¬ 
forming  supervised  learning  of  the  classes.  The  leaves  of  the  tree 
contain  class  labels;  the  Intermediate  nodes  may  contain  members 
of  classes  with  different  labels.  During  testing,  the  tree  is  traversed 
to  obtain  the  label  prediction  lor  the  new  samples.  Each  data  sam¬ 
ple  (belonging  to  one  of  K  Labeled  classes!  Is  represented  asa  vector 
x  e  m.  During  training*  eadi  such  vector  xr  has  a  corresponding 
label  yj£  l, .. .  K.  (The subsequent  “  Experiments’ section  below 
describes  the  methods  used  to  transform  raw  data  sudi  as  nat¬ 
ural  Images  Into  wetor  representations  In  a  domain-dependent 
fashion.) 

XI.  TRAINING 

The  Input  to  the  training  procedure  Is  the  training  dataset  consist¬ 
ing  of  (rj,  Y()  purs  where  j,  is  an  input  rector  and  Is  Its  Intended 
class  label,  as  In  all  supervised  learning  methods.  The  otLlpul  Is  a 
tree-  that  Es  built  by  performing  a  succession  of  un supervised  splits 
of  the  data.  The  data  corresponding  to  any  gtren  node  In  the  tree 
Is  a  subset  of  theorlglnai  training  dataset  with  the  full  dataset  cor¬ 
responding  to  the  root  of  the  tre£.  The  action  performed  with  the 
data  at  a  node  In  the  tree  Is  an  unsupervEsed  split,  thereby  generat¬ 
ing  similarity-based  clusters  (subcluslersl  of  the  data  within  that 
tree-  node.  Theunsupervlsed  gslll  results  In  expansion  (deepening! 
of  the  tree  at  that  node*  with  the  child  nodes  corresponding  lo  the 
newly  created  unsupervEsed  data  clusters.  The  cluster  represen¬ 
tations  corresponding  to  the  children  are  recorded  In  the  current 
node.  These  representations  are  used  to  determine  thelocal  branch 
that  will  be  taken  from  this  node  during  testing*  In  order  to  obtain 
a  class  prediction  on  a  new  sample.  IFor  each  of  the  new  children 
nodes*  the  labels  of  the  samples  within  the  duster  are  examined, 
and  If  they  are  deemed  to  be  sufficlentlypure,  a  sufficient  per¬ 
centage  of  the  data  belong  to  the  same  class,  then  the  child  node 
becomes  a  (terminal)  leaf  In  the  tree.  If  not*  the  node  is  added 
to  a  queue  which  will  be  subjected  to  further  processing*  growing 
the  tree  This  queue  Is  Unitialized  with  the  root  of  the  tree.  The 


procedure  (sketched  In  Algorithm  1  below)  proceeds  until  the 
queue  becomes  empty. 

To  summarize  the  mechanism*  the  algorithm  attempts  to  find 
clusters  based,  on  appearance  similarity,  and  when  these  dusters 
don’t  match  with  the  Intended  {supervised.)  calegortes,  reinforce¬ 
ment  -simply  ^ves  the  algorithm  the  binary  command,  to  either 
split  or  not  split  Ibe  errant  duster.  The  behavior  of  the  algorithm 
on  sample  data  Is  Illustrated  m  Figure  2.  The  Input  -space  of  Images 
Is  partitioned  by  success  Irely  splitting  the  corresponding  training 
samples  Into  subclusLers  at  each  -step. 

21.1.  PkAmv  tirarigtrtbrmdr  frefor 

Since  the  main  tree  parameter  In  the  algorithm  Is  the  number 
of  unsupervised  dusters  to  be  spawned  from  any  given  node  In 
the  hierarchy,  the  Impact  of  that  parameter  on  the  performance 
of  the  algorithm  should  be  studied.  This  quantity  corresponds 
to  die  branching  ihclnr  far  the  class  tret  Vyc  Initially  propose  a 
single  parameter  as  an  upper  bound  for  the  branch  factor  K™1, 
which  fira  the  largest  number  of  branches  that  can  be  spawned 
from  any  node  In  the  tree.  Through  experimentation  (discussed 
In  the  Results  section)  we  have  determined  that  ( J)  very  small  val¬ 
ues  for  this  parameter  result  In  slightly  lower  prediction  accuracy; 


Input:  Ducouof:  X.  -  |  of;  z  jftw)  with  LabcL.- 
V-  fee  il:3,3.XJJ 

□iLfpul  Class  Tree;  A  tree  rocled  a l  die  ncd.e 
I'Koot 

lull:  TI-? oot  X  -  X.  TKoot.Y  -  Y;  TKcot  Labels  - 
LABELSET(Y) 

Q  [J,  Add(Q,  TKoot), 
while  Q  is  nor  L'fjipr;1  do 
qn  First  in  Q 
if  SUBDIVIDED =  Lr-j*:  llwn 

[Centfoids,  Clii  stefs]  -  PARTITION  (X,.r., 
K+) 

foiftjch  Ciitultr  r?--  do 

Nudtf  T 

T.X  -  auflfcsrft[k] 

TT.hI^K  =  T.A.fiFT.SF.T(Y(T.X;i) 

ijTi.FiTur  i  Tih"--|k]  —  Cirri  l-iiniilMlk  | 

qii.C-^ldriMifk]  =  T 

Add(a  1) 

end 

bind 

#nd 

hI'3'2 Hthm  II  |  A  s^ibd-  of  tvs  CEL.  laarrng  aborthn  Iho  mathiod 
constructs  ;  mots  trea,  -whch  racank-  Lnstpenkad  oturua  with h 
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ra  Jirac;  sfibcliau;  ler  this  a  oonthrr  a  a  Later  n  tha  tort.  Tha 
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F1GURE  Z  |  A  ■rnpJrfiad  fcrtntianol  tha  rtorathra  learning  pro  mm  with 
I  sbbcd  data.  Images  :re  succass’vdy  spit  into  two  partrtxms-  n  an 
Lrsyparvissd  Fashon  U.d.,  by  smla-ny!  Thn  aamonrg  of  data  rrooM-rb 


flaratwaly  i.nl  the  disbars  Iwmcd  am  puna  with  raspact  thar  lat-ak  Rm 
oach  sfJrt,  -an  arthar  sac  of  th#  drvidng  rrypa-rpbna.  tha  moans  ana  shown  as 
an  on  day  x  die  rraoa:  that  Fa  I  on  tha  ccrrn^oondhg  sdo  :■!  Ihs  hypwpbna. 


(II)  fat  sufficiently  large  Tralees,  the  panunetef  selling  has  no  slg-  requirements  of  Hue  tJi^nlit  algorithm  and  thus  thE  runtime  of 
nfficant  Impact  on  the  per  fbrmance  efficacy  of  the  classlfie  r;  and  the  leant  In  gat  age  (see  the  Results  section  below  for  further  detail  j. 

(III)  larger  values  of  the  pa nm etc:  modestly  Increase  the  memory  ( ll  Is  worth  noting  that  selection  of  the  best  branch  factor  value 
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may  bE  obtained  by  examination  of  Lhnz  distribution  of  the  data 
to  I>e  partitioned  En  the  input  space,  enabling  automatic  selection 
of  the  Ideal  number  of  unsupervised  clusters  without  reference 
to  the  number  of  distinct  I  .Lie Jed  classes  that  occur  In  the  space. 
Future  wort  may  enlill  the  study  of  existing  methods  for  this 
approach,.  Baron  and  towr,  199];  Teh  el  aL,  2001,  as  potential 
adjunct  Improvements  tolheCSL  method.) 

li  TREE  PMJNING 

Categorization  algorithms  are  often  subject  to  owrfrUJng  the  data. 
Aspects  of  the  C5L  algorithm  can  be  formally  compaied  to  those 
of  decision  trees,,  which  are  subject  to  owrfiltJng. 

Unlike  decision  trees,  the  classes  represented  at  the  Leaves  of  the 
CSL  tree  need  not  be  regarded  as  conjunctions  of  attribute  values 
on  the  path  from  the  root*  and  can  be  treated  as  fully  represented 
classes  by  themselves.  (Wfe  r efer  to  this  as  the  ""leaf  Independence" 
property  of  the  tree;  this  property  will  be  used  when  we  describe 
testing,  of  tbE  algorithm  in  the  next  section.)  Also,  since  the  splits 
□re  unsupervised  and  based  on  multidimensional  similarity  (also 
unlike  decision  trees),,  they  exhibit  robustness  w.r.L  variances  In 
small  subsets  of  features  within  a  class. 

Both  of  these  characteristics  (leaf  independence  and  unsuper- 
vlsed  spl  itting!  theoretically  lead  bo  predictions  of  kssorerfitlJng 
of  the  method. 

In  addition  to  these  formal  observations,  we  studied  overfit- 
tlng  In  the  CSL  method  empirically.  Analogously  to  dec.tslcn  trees, 
we  could  choose  either  to  stop  growing  the  tree  before  all  leaves 
were  perfectly  pure  (and  potentially  overfit},  or  to  build  a  Ml 
tree  and  then  somewhat  prune  It  back.  Both  methods  Improve 
the  overruling  problem  observed  In  decision  trees.  Experiments 
with  both  methods  In  theCSL  algorithm  found  that  neither  ocie 
had  a  significant  effect  on  prediction  accuracy.  Thus,  surprisingly, 
both  theoretical  and  empirical  studies  find  that  the  C5L  class  trees 
generalize  well  without  overfilling;  the  method  Is  unexpectedly 
resistant  to  overfitting 

3.3.  TESTING 

During  testing,  the  algorithm  is  presented  with  p  rev  lately  unseen 
data  samples  whose  dass  we  wish  to  predict.  The  training 
phase  created  an  appearance- based  dass  hierarchy.  Since  the 
tree,  Including  the  ''pure  class"  leaves.  Is  generative  In  nature, 
there  are  two  alternative  procedures  for  class  prediction.  One 
Is  that  of  descending  the  tree,  as  is  done  En  decision  trees. 
E-towever,  hi  addition,  the  “Leaf  Independence"  properly  of  the 
CSL  tree,,  as  described  In  the  previous  section  (which  does 
not  hold  for  decision  trees),  enables  another  testing  method, 
which  we  refer  to  as  KNN-on-leafves,  In  which  we  only  attend 
to  the  leaf  nodes  of  the  tree,  as  described  In  the  second  sub¬ 
section  below.  (This  property  does  not  hold  for  decision  trees, 
and  thus  this  additional  testing  method  cannot  be  applied,  to 
decision  trees).  The  two  Lest  methods  have  somewhat  different 
memory  and  computation  coats  and  slightly  different  prediction 
accuracies. 

3.3.  t.  Free  descent 

This  approach  starts  at  the  root  of  the  class  tree,  and  descends.  At 
every  node,  the  test  datum  Is  compared  to  the  cluster  centroids 


stored  at  the  node  to  determine  the  branch  to  take.  The  branch 
taken  corresponds  to  die  dosesl  centroid  to  the  test  datum;  Le., 
a  decision  Is  made  locally  at  the  node.  This  provides  us  a  unique 
path  from  the  root  of  the  class  hierarchy  to  a  single  leaf  the- stored 
oategurylabd  at  that  leaf  Es  used  to  predict  the  label  of  the  Input. 
Due  to  tree  pruning  (described  abow),  the  leaves  may  not  bE  com¬ 
pletely  pure.  As  a  remit.  Instead  of  relying  on  any  given  dass  being 
present  In  the  leaves,  the  posterior  probabilities  for  all  the  cate¬ 
gories  represented  at  tbE  Leaf  are  used  to  predict  the  class  label  for 
the  sample. 

312.  AMV-OT-fewes 

En  this  approach,  we  make  a  note  of  all  the  kaws  in  the  tree, 
alongwlth  the  cluster  representation  In  the  parent  of  the  Leaf  node 
corresponding  to  Lhc  branch  which  leads  to  the  leaf.  Wc  then  do  K- 
nearest  neighbor  matdilngofthe  test  sample  with  ail  these  cluster 
centroids  that  correspond  to  the  kaves.  The  final  Label  predicted 
corresponds  to  the  label  of  the  leaf  with  the  dosesl  centroid.  This 
approach  implies  that  only  the  kaws  of  the  tree  need  to  be  stored, 
resulting  In  a  significant  reduction  In  the  memory  required  to 
store  the  Learned  model.  However,  a  penalty  Is  paid  In  recognition 
time,  which  En  this  case  is  proportional  to  the  number  of  leaves  In 
the  tree. 

The  memory  required  to  store  the  model  in  the  tree  descent 
approach  Is  higher  than  that  for  the  KNN-on-Leaves  approach. 
However,  tree  descent  offers  a  substantial  speedup  In  recognition, 
as  comparisons  need  to  be  performed  only  along  a  single  path 
through  the  tree  from  the  root  to  the  final  kal  The  algorithm  Is 
sketched  bdow  Ln  Algorithm  2. 

Vit expect  that  the  KMN-on- leaves  variant  will  yfeld  teller  pre¬ 
diction  accuracy  as  the  decision  Is  made  at  the  end  of  the  tree  and 


Input:  :r  G  JiM  .  L'Iiue  tree:  IKjtxjt 
Output:  -i-i  £1,2.  ...if 
inil:  I  ifts  Node  T  =  TEjnot 
while  T  is  nnf  ir.m  do 
mwfjbVni  =  (J 

fur  k  ~  1:  |'j!  '  Uhi i. dren  ib> 

£m.  =  S®VHLAFIIY(>,  7.CMliBul$[k]) 
it  Shfl  >  vjusifi'u.ri  flncn 

ImiCiStSiat  =  ;:irr- 

brands  =  k 

end 

erdl 

T  -  T.Childr*ri|lwan[^| 

end 

y  -  T.LabciSct 

Algorithm  2  |  A  stnbdi  of  tfra  tbs  dascert  jloofttn  tier  ds^ffyng  a  now 
data  samfxL.  Tbs  -nslhcd  starts-  at  tho  root  noda  and  discards  tas-tng  llie 
sanfifl  Kir i  jEanrt  *adi  irads  an  couitcr-ad  ta  detain  r*  ths  braiTi  Id 
sakcl  far  trthar  qbscbiI.  Tha  rasull  e  a  unique  pafii  Irotr  7*  r:c:  Ic  j 
isrds  sar;  the  sVxad  caraporv  at  that  laaF  e-  tw  preduTcn  at  fra  bts-  of 
tbs  input  In  ths  svst  of  mpus  Luke,  din  porter  <  r  ptobafa  Ltbd  Ief  -1 
cat-agoras  n  ths  leal  are  jssd  Ic  piacKt  the  dhss  bbal  d  ths  sampia.  Saa 
tart  for  further  desdipban. 
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bcncc  the  partitioning  of  Lhc  Input  spice  Is  expected  to  cxJi  Ibll 
hn: tier  general  lzaLla-n_  In  the  ease  of  t ree  descent,  since  decisions 
are  made  locally  within  the  tree.  If  the  ditaset  has  b£b  variance . 
Lhen  II  Is  possible  that  a  wrong  b  fa  neb  will  be  taken  eoiiyon  In  the 
tree,  Leading  to  inaccurate  pred  Ictlc-r-.  ThJs  prdbkm  Js  common  to 
a  large  family  of  algorithms,  Indudlngdeclslon  trees.  We  have  per¬ 
formed  experiments  to  compare  the  two  test  methods;  the-  results 
confirm  that  the  K.KK-on  -leaves  method  exbirtts  marginally  bet¬ 
ter  prediction  than  the  tree -descent  method.  The  behavior  of  the 
two  methods  Is  Illustrated  In  Figure  A. 

4.  CLUSTERING  METHODS 

The  only  remaining  design  diolce  Is  which  unsupe rvlsed  dus- 
terlng  algorithm  to  employ  for  successively  partitioning  the  data 
during  training*  and  the  corresponding  similarity  measure.  The 
choice  can  change  depending  on  the  type  of  data  to  be  classified, 
while  the  overall  framework  remains  the  same,  yielding  a  poten¬ 
tial  family  of  dosety  related  variants  of  the  CM  algorithm.  This 
enables  fkxIbiUty  In  selecting  a  particular  tin  supervised  clustering 
algorithm  fora  given  domain  and  datasevwlthmh  modifying  any¬ 
thing  dse  In  the  algorithm.  (Using  different  clustering  algorithms 
within  the  some  class  trip  Is  also  feasible  as  all  decisions  are  made 
locally  In  the  tree.) 


There  are  numerous  clustering,  algorithms  from  the  simple  and 
efficient  k-means  (Lloyd,  I  gfllhsetf  organEzlng  maps  (SOM;  Kaskl, 
1997)  and  competitive  networks  ( Kosko,  1 99 1  )h  to  the  more  elabo¬ 
rate  and  expensive  probabilistic  generative  algorithms  Like  mixture 
of  Causslons,  Probabilistic  blent  semantic  analysis  (PLSA;  Hoff¬ 
man,  1 999)  and  [anient  Dtrfchlet  Allocation  f LDA;  3Jel  et  aLh2DQ3)i 
each  has  merits  and  ousts.  Given  the  bio-logical  derivation  of  the 
■system,  we  began  by  choosing  k-mcans .  a  simple  and  Inexpen¬ 
sive  cluster Ing  method  that  has  been  discussed  previously  as  a 
candidate  system  for  biological  clustering  (darken  and  Moody, 
1990);  the  method  could  Instead,  use  SOM  or  competltfw  learn¬ 
ing  two  highly  rebted  systems.  (' Et  remains  quite  possible  that  more 
robust  fand  expensive)  algorithms  such  as  Pl£A  and  LDA  could 
provide  improved  prediction  accuracy.  Improvements  might  also 
arise  by  treating  the  dab  at  every  node  as  a  mixture  of  Gaus¬ 
sian  s.  and  estimating  the  mixture  parameters  using  ^expectation 
maximization  i  ILMj  algorithm.) 

4.1.  Jr -MEANS 

t-Afeom  Is  one  of  the  most  popular  algorithms  to  cluster  n  vec¬ 
tor:-  based  on  distance  measure  Into  V  partitions,  where  it  -,  n  It 
attempts  to  find  the  centers  of  natural  clusters  In  the  data  The 
objective  that  Jfc-meons  tries  bo  mEnlmEze  is  the  total  tmra  cJusm 


FIGURE  a  |lwa  ntt;-itd:-  bv  which  the  CEL  algorithm  prsdict-s 
category  nembenhip  at  Iasi  dm*.  ILarftJ  Qzs  predion  via 
hwamhied  tosum  At  «cn  step,  a  new  U  vjidwtJ  sanpla  will  Fa  I 
on  one  o'  tha  clhe--  see  of  a  classrceocn  fryparpLana.  The  cbccEfi 
provides  a  oath  tvoq^i  the  dzs  iw  at  a-adh  rone  At  the  leaves  tha 


:las:  predeton  s  red  T*  numbering  gr^s  tha  or dar  r  *hd- 
the  hyp-aqi  Laras-  an  piubad.  G9igj--J  Hass-  predkber  onL  lavas 
of  dzs  too.  All  laws  ere  coniderad  s-ndta neons Fy;  tha  tasl 
sample  e-  compared  be  each  naf  err  tha  dzs  praom.cr  e  data  rod 
uangKNN. 
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rariffjflff,  or,  the  spared  error  function: 

*=EEw- m1 

fc] 

wti ere  there  are  K  clusters  £*,  t  =  1 1 2* . .  .*  K  and  u?  is  the  centnold 
or  mean  point  of  ill  the  points  xj  eQ. 

When  k-means  la  used  for  thEimsupervlsed  appearance -based 
chj5lerLp.it  at  the  nodes  of  the  class  tree,  the  actual  means  obtained 
are  stored  at  each  node*  and  the  similarity  measure  IS  Inversely 
proportional  to  the  Euclidean  distance. 

ll.l.  kritaiatfg  dusters 

Eo  genera],  unsupervised  methods  are  sensitive  to  InJtHlEzallon. 
\*fe.  IndtUlEze  the  clustering  algorithm  at  every  node  Gm  the  class 
Lncc  as  follows. 

If  we  are  at  node  J,  with  samples  havlngone  of  Jabcls.  we  first 
determine  thE  class  averages  of  the  Jf’categprJes.  (Edr  every  class, 
we  remove  the  samples  which  are  at  least  2  standard  deviations 
away  from  the  mean  of  the  class  for  the  initialization  These  sam¬ 
ples  are  considered  for  the  subsequent  un supervised  clustering.) 
If  the  number  of  dusters  (brandies),  K*  =  mJn  K71™}  turns 
out  to  be  equal  to  fij,  then  the  aver  ages  are  used  as  the  seeds  for 
the  duster  Inga  Igor  I  thm.  If  however  K*  <  X^then  we  use  a  simple 
and  efficient  method  for  obtaining  the  Initial  clusters  by  using  an 
Initial  run  of  k-means  on  the  Kj  averages  In  order  to  obtain  the 
K*  Initial  centroids.  The  data  samples  are  assigned,  to  the  dus¬ 
ters  using  nearest  neighbor  mapping,  and  the  averages  of  these  K* 
clusters  are  used  as  seeds  for  a  subsequent  run  of  the  on  super¬ 
vised  clustering  algorithm.  ( In  our  empirical  experiments  we  have 


used  the  k-means-i— s-  variant  of  the  popular  clustering  algorithm 
to  obtain  the  Initial  duster  seeds;  Author  and  VassLlvILsklL,  2007.) 
FieureJ  Illustrates  the  EnltlaLtzatlon  method.  (While  the  method 
worts  relatively  well,  further  studies  indicate  that  other  meth¬ 
ods*  which  directly  nil  Ilk-  the  semantic  structure  of  the  labeled 
dataset*  can  result  En  ewn  better  performance.  These  alternate 
approaches  are  not  discussed  In  this  paper  m  order  to  keep  the 
focus  on  Introducing  the  core  algorUluiL)  It  is  worth  noting  that 
the  Inltlaltzatlon  method  can  bethought  of  in  terms  of  a  logically 
prior  “dewlopmentar  period*  in  which  no  data  Is  actually  stored* 
but  Instead  sampling  of  the  environment  is  used  to  set  parameters 
of  the  method;  those  parameters,  once  fixed,  are  then  used  Ln  the 
subsequent  performance  ofthe  then -“adult"  algorithm  (Felch  and 
Granger*  1D0H). 

5,  EXPERIMENTS 

The  proposed  algorithm  performs  a  number  of  operations  on 
Its  Inpat*  tndudtng  the  un supervised  discovery  of  structure  In 
the  data.  However,  since  the  method*  despite  being  composed 
only  of  on  supervised  clustering  and  reinforcement  learning*  can 
nonetheless  perform  supervised  teaming,  we  haw  run  tests  that 
Involve  using  the  C5L  method  solely  as  a  supervised  classifier. 
In  addition  to  these  tests  of  supervised  learning  alone*  we  then 
briefly  describe  some  additional  findings  Illustrating  theCSL  algo¬ 
rithm's  power  at  tasks  beyond  the  classification  task  ('Including  the 
tasks  of  Identifying  structure  in  data*  and  localizing  objects  within 
Images). 

When  viewed  solely  as  a  supervised  classifier,  the  GSL  method 
bears  resemblances  to  two  well-studied  methods  In  machine  learn¬ 
ing  and  statistics*  and  we  rigorously  compare  these.  Vte  compared 
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Lbc  accuracy.  and  the-  lime  and  apace  costs,  of  the  CSL  algo¬ 
rithm  is  a  supervised  classifier,  a  gainst  the  support,  vector  machine 
ESVM)  and  k-nearest  neighbor  fKNM)  algorithms.  fkrtormance 
was  examined  an  two  well-studied  public  datasets. 

Bar  SYM,  we  have  used  the  popular  LlbSYM  ImpkmentaLloci 
that  Ls  piifllcly  available  (Chang  and  Lin,.  200 1 ).  This  package 
Implements  the  'one  ve  one"  flavor  of  multiclass  classification, 
rather  than  “one  vs  rest"  variant  based  on  the  findings  reported 
In  Hsu  and  Lin  £im2)L  After  experimenting  with  a  few  kernels, 
we  chose  the  linear  kernel  since  It  was  the  most  efficient  and 
especially  since  It  provided  the  best  SYM  results  for  the  hlgh- 
dlmcnslonaJ  datasets  we  tested.  It  Is  known  that  for  the  Itnear 
kernel  a  weight  vector  ran  be  computed  and  hence  the  support 
vectors  need  not  be  kept  In  memory,  resulting  In  Low  memory 
requirements  and  last  recognition  time.  However,  this  Es  not  true 
for  non-linear  kernels  where  support  vectors  need  to  be  kept 
In  memory  to  gjet  the  dass  predictions-  at  mu  time.  Since  we 
wish  to  compare  the  classifiers  En  the  general  setting  and  it  Is 
likely  that  die  kernel  trick  may  need  to  be  employed  to  separate 
non-linear  Input  space,  we  have  retained  the  Implementation  of 
LlbSVM.  as  It  Es  (where  the  support  vectors  are  retained  hi  mem¬ 
ory  and  used  during  testing  to  get  dass  prediction).  Vv'c  realize 
this  may  not  be  the  fairest  comparison  for  the  current  set  of 
experiments,  however,  we  beLleve  that  thJs  selling  Is  more  reflec¬ 
tive  off  the  ly plead  use  case  scenario  where  the  algorithms  will  be 
employed. 

Bar  KNN  we  have  hand  coded  the  Implementation  and  set 
the  parameter  K=  1  for  maximum  efficiency:  (Bor  the  CSL  algo¬ 
rithm  with  KNN-on-leaves,  we  use  f£=  1  as  well.)  The  test  bed  Is 
a  midilne  running  windows  XP  64  with,  BCE  memory.  lflfe  have 
not  used  hardware  acceleration  for  any  of  the  algorithms  to  keep 
the  comparison  fair. 

We  have  used  two  popular  datasets  from  different  domains 
with  wry  different  characteristics  (Including  dimensionality  of 
the  data)  to  fidty  explore  the  strengths  and  weaknesses  of  the  algo¬ 
rithm.  One  Is  a  subset  of  the  Calbreh-256  Image  set,  and  the  other  Is 
a  very  htgh-dlmenslonaL  dataset  of  neuroCmagtng  data  from  fMHl 
experiments,  that  has  been  widely  studied. 

Bar  both  experiments,  we  performed  multiple  runs,  differently 
splitting  the  samples  from  each  class  Into  training  and  testing  sets 
( rough Ly  equal  in  number).  The  results  shown  Indicate  the  means 
and  standard  deviations  of  all  runs. 

'5.1.  OBJECfRECDGNTIDN 

Our  first  experiment  tests  the  algorithm  for  object  recognition 
In  natural  still  image-  datasets.  The  task  Is  to  predict  die  label 
for  an  Image,  having  Learned  the  various  classes  of  objects  In 
Images  through  a  training  phase.  We  report  empirical  findings 
for  prediction  accuracy  and  computational  resources  required. 

£1.1.  Oats  set 

The  dataset  used  consists  of  a  subset  of  the  Galbech-256  dataset 
(Griffin  el  at,  2007)1  using  39  categories,  each  with  roughly  LDG 
Instances.  The  categories  were  specifically  chosen  to  exhibit  very 
hl^i  between -cate^ry  slmlflarlty,  Intent  ion  ally  selected  as  a  very 
challenging  task,  with  high  potential  confusion  among  classes.  The 
categories  are: 


*  Mammals:  bear,  chimp,  dog,  elephant,  goat,  gorilla,  kangaroo. 
Leopard,  raccoon,  zebra 

*  Winged:  duck,  goose,  hummingbird,  ostrich,  owl,  penguin, 
swan,  bat,  cormorant,  butterfly 

*  Crawlers  ( rep t lies/ 1 nsecls1  a rt h ro pa d s'am phi b I ans ):  [guana, 
cockroach,  grasdiopper,  housefly,  praying  mantis,  scorpion, 
snail,  spider,  toad 

*  Inanimate  objects:  backpack,  baseball  glove,  binoculars,  bull¬ 
dozer,  chandeliers,  computer  monitor,  grand  plane*.  Ipod, 
Laptop,  micro  wave. 

We  have  chosen  an  extremely  simple  (and  wry  standard)  method 
for  representing  Images  In  order  to  maintain  focus  on  the  descrip¬ 
tion  of  the  proposed  classifier.  First  a  feature  vocabulary  consisting 
of  SIFT  features  (I^owe,  2004)  Is  constructed  by  running  k-means 
on  a  random  set  of  Images  containing  examples  from  aU  classes  of 
Interest:  each  Image  Is  then  represented  asa  histogram  of  these  fea¬ 
tures.  The  positions  of  die  features  and  their  geometry  Es  glared, 
simplifying  the  process  and  reducing  computational  costs.  Thus 
each  Image  to  a  vector  x  e  flm,  where  m  Is  the  size  of  the  acquired 
vocabulary.  Each  dimension  of  the  vector  to  a  count  of  the  number 
of  times  the  particular  feature  occurred  In  the  Image.  This  rep¬ 
resentation,  known  as  the "  Bag  of  Vfords,1"  has  been  successfully 
applied  before  In  several  domains  Including obfecl  recognition  In 
Images  (Sivlcand  Zlsserman,  2D03). 

We  ran  a  total  of  i  trials,  corresponding  to  &  different  random 
parUtEonln^  ofthc  Caltech-  256  data  Into  training  and  testing  sets. 
In  each  trial,  we  ran  the  test  for  each  of  a  range  of  values,  to 
best  this  free  parameter  of  the  CSL  model. 

PrsrficCriiT  ae curacy 

The  graph  In  top  left  of  R^are  5  compares  the  classifier  pre¬ 
diction  accuracy  of  the  preposed  algorithm  with  that  of  SVMs 
on  the  39  subsets  of  Caltech-256  described  earlier.  As  expected, 
the  slmpLlsUc  Image  representation  scheme,  and  the  readily  con¬ 
fused  category  members,  renders  the  task  extremely  dlfficidL  Dt 
will  be  seen  that  alt  classifiers  perform  at  a  very  modest  success 
rate  with  this  data,  Indicating  Hie  difficulty  of  the  dataset  and 
the  considerable  room  for  potential  Improvement  In  classification 
techniques. 

The  two  variants  of  the  CSL  algorithm  are  competitive  with 
SYM:  SVM  has  an  average  accuracy  of  23.9%;  CSL  with  tree 
descent  has  an  average  accuracy  of  19.4%;  and  CSL  with  KNN- 
on-kaves  has  an  average  prediction  accuracy  of  2 1 .3%.  The  KNN 
algorithm  alone  performs  relatlvelypoorly,  with  an  average  predic¬ 
tion  accuracyof  ]  3.6%.  Chance  probability  of  correctly  predicting 
a  class  to  ]  out  of  39  (2.56%). 

It  can  be  seen  that  the  branch  factor  does  not  have  a  signif¬ 
icant  Impact  on  error  rates.  This  Is  possibly  because  the  class 
tree  grows  until  the  leaves  are  pure,  and  the  resulting.  Internal 
structure,  though  different  across  choices  of  A™1,  does  not  signif¬ 
icantly  Impact  the  ultimate  classifier  performance  as  the  hierarchy 
adapts  Its  shape.  Different  Internal  structure  could  significantly 
affect  the  performance  of  the  algorithm  on  tasks  that  depended 
on  the  similarity  structure  of  the  data,  but  for  the  sole  task  of 
supervised  classification,  the  tree’s  Internal  nodes  have  liltk  effect 
on  prediction  accuracy. 
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FIGURE  5  |Compjcin»i  dF  CEL  aid  other  standard  daaifian,  iTep  tft 
Pradtetien  accuracy  &  tha  CEL  dbsrafsr  on  tha  L;llcdi-?=-=-  siitHts.llia 
sccr-as  r  bJua  ana  tbs  rates-  adimad  by  tha  CEL,  obrafw.  Scores  r  p#ifc  am 
femn  sla~idxrd  mu-aass  5VM  baa  :aKti.  (Top  Mamny  iwpjr-amaTts-  For 
CEL  with  e^4d  bo  the  brandi  holer  p-arair-ate*  T«  ftgja  shows-  thal  tbs 
paairalEr  dees  rob  sgirlKariy  rrpad  tha  sl»  of  tha  traa.  Abo.  mi  osr  »a 


Sound  o-  brine-  teebur 


i  doai  dMorarioa  tha  rrwrory  lsmie  erf  CSL  and  tha  othn 

SLfrannsad  chssrfvan:  after  ran  rg  IBotbm  Icfti  Run  InK  For  tha  tranrg 
t'ioi.  CSL  badi  raqiiras  mugtdy  an  odar  of  rragituaE  loss-  tr:n  nq  run  lira 
Ihsr  jiW  il jfc  lEoron  righcli  AvEraga  tma  to  r-aoognma  a  raw  rruga  after 
Ira  ring  taf  tha  [Hbnrt  ilgorm:  Tha  y  ans-  ilogs  ilhm:  scalnl  shows-  CSL 
oqtpadbrnrrg  SVM  and  KNN  by  an  erdaroi  magnhida. 


SIFJt  Aten®  ry  use  ge 

The- graph  In  tap  right  a:  Figures  shows  the  rehtlondUp  between 
tbE  avetall  number  erf  modes  in  the  tree  to  bE  retained  [and  hence 
vectors  of  dimensionality  AfJ  and  the  branch  factor  far  CSL  clas¬ 
sifier.  CSL  with  tree  descent  had  to  store  an  average  of  103*125 
vectors,  while  the  kmn-on-leaves  variant  had  lo  store  W2.2]  vec¬ 
tors.  SVM  required  12B6  vectors  while  the  vanilla  KNN  method 
(with  h  =  I  >  requires  storage  of  Lhe  entl  re  training  corpus  of  2322 
vectors.  Thus,  the  number  of  vectors  retained  In  memory  by  the 
CSL  variants  Is  roughly  half  the  number  retained,  by  the  SVM  and 
KNN  algorithms.  Further*  the  memory  needed  to  store  the  trained 
model  when  we  predict  rnsDig  the  KNN-om-le-aves  approach  Js 
smaller  than  when  we  use  tree  descent*  ad  we  expected,  and  dis¬ 
cussed  earlier.  As  can  be  seen*  there  Is  not  much  variation  in  CSL 
performance  across  different  branch  factor  values.  This  suggests 
that  after  a  few  Initial  splits*  most  of  tbE  sub  trees  hare  very  few 


categories  represented  within  them  and  hence  the  upper  bound 
on  the  brandi  factor  does  not  play  a  significant  rule  in  ongoing 
performance. 

£U  ChntvmMtmm 

The  runtime  costs  of  the  algorithms  paint  an  even  more  startling 
picture.  The  graph  in  bottom  left  of  Figure-  5  shows  the  plots 
comparing  the  training  times  of  the  CSL  and  SVM  algorithms. 
The  two  variants  of  CSL  haw  the  same  training  procedure  and 
hence  require  the  same  time  to  train.  (KNN  has  no  explicit  train¬ 
ing  stage.}  As  can  be  seen,  the  training  time  of  the  new  algorithm 
(average  of  2.42  s)  Ls  roughly  an  onkr  of  magn  itude  smaller  than 
that  of  the  SVM  (average  of  15.54  s).  It  should  be  clearly  noted 
that  comparisons  between  implementations  of  algorithms  will  not 
necessarily  reflect  underlying  computational  coals  inherent  to  the 
algorithms,  forwhJcb  further  analysis  and  formal  treatment  wtU  be 
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required.  Nonetheless,  In  the  present  experiments,  the  empirical 
coats,  were  radically  different  despite  efforts  to  diow  the  SVM  In 

I  la  best  1 1  gill - 

As  Indicated,  earlier,  the  choice  of  branch  factor  does  not  have  a 
large  impact  on  Lhc  training  Lime  needed.  Vic  Uso  found  that  the 
working  memory  requl  rements  of  our  algorithm  were  very  smalt 
compared  to  that  of  the  5VM_  In  the  extreme,  when  large  represen- 
tatlocia  were  used  for  Images,  Lhc  memory  requl  remen  Is  fat  SVM  5 
rendered  the  task  entirely  Impracticable.  In  auch  circumstances, 
the  CSL  method  atlJI  performed  effedJ vely.  The  working  amount 
of  memory  we  need  is  proport  Iona  I  to  the  largest  clustering  Jcb 
that  needs  to  be  performed.  By  choosing  Low  values  of  A™,  we 
empirically  find  that  we  cam  keep  IhiE  requirement  low  without 
loss  of  classifier  performance. 

The  bottom  rjgjit  plot  of  Figure  5  shows  bow  the  average 
time  for  recognizing  a  new  Image  varies  with  branch  factor. 
The  Limes  are  shown  In  logarithmic  scale.  The  CSL  variants  are 
an  order  of  magnitude  faster  than  KNN  and  SVM  aJgqrlthmE 
with  the  tree  descent  variant  being  the  fastest.  This  shows  the 
proposed  algorithm  in  its  best  light.  Once  training  is  complete, 
recognition  can  be  extremely  rapid  by  doing  hierarchical  descent, 
making  the  CSL  method  unusually  well  suited  for  real-time 
applications. 

II  H  AXBf  1MH I  DATASET,  HOI 
E2.1.  Oebsot 

Having  demonstrated  the  CSL  system  on  Image  data,  we  selected 
a  very  different  dataset  to  test:  nenrolmaglng  data  collected  from 
tbE  brain  activity  of  human  subjects  who  were  viewing  pictures. 
As  with  the  Caltech-25£  data,  we  selected  a  very  weU-sLudled  set 
of  IMR1  data,  from  a  200 1  study  by  Haxby  et  ai  f  2DCH ). 

Six  healthy  human  volunteers  entered  an  IMR.I  nemrolmaglng 
apparatus  and  viewed  a  set  of  pictures  while  their  brain  activity 
(blood  oxygen-level  dependent  measures?  was  recorded.  In  each, 
run,  the  subjects  passively  viewed  gray  scale  images  of  eight  object 
categories,  grouped  hi  2d  s  blocks  separated  by  rest  periods.  Eadi 
Image  was  shown  for  500  ms  and  was  followed  by  a  1 500-ms  Cnter- 
stLmuhjs  Interval.  Each  subject  earned  out  twelve  of  these  runs. 
Th.e  stimuli  viewed  by  the  subjects  consisted  of  Images  from  the 
following  eight  classes:  Faces,  Cats,  Chairs,  Scissors,  Houses  Bot¬ 
tles,  Shoes,  and  random  scrambled  pictures.  Full-brain  fMRi  data 
were  recorded  with  a  volume  repetition  time  of  Z5  s,  dms>  a  stim¬ 
ulus  block  was  covered  by  roughly  9  volumes.  For  a  campLete 
description  af  tbE  experimental  design  and  fMBl  acquisition  para¬ 
meter^  see  E  iaxby  el  ai.  (2001  j.  {The  dataset  Is  publicly  available.) 
Each  fMHI  recording  corresponding  to  1  volume  In  a  block  for 
a  given  Input  Image  can  be  thought  of  as  a  vector  with  I63.B40 
dimensions.  The  recordings  far  alt  the  subjects  hawe  the  same  vec¬ 
tor  length.  (In  the  original  work,  “masks”  for  Individual  brain 
areas  were  provided,  retaining  only  those  rexels  that  were  hypoth¬ 
esized  by  tbE  experimenters  to  play  a  significant  role  In  object 
recognition.  Using  these  masks  reduces  the  data  dimensional¬ 
ity  by  a  large  factor.  However,  the  masks  are  of  different  Lengths 
for  different  subjects,  thus  preventing  meaningful  aggregation  of 
recordings  across  subjects,  Thau*  we  have  iwl  used  the  masks  and 
Instead  trained  the  classifiers  in  the  original  high  dimensional 
space.) 


522.  Tmtwrg  on  rnrtm&Bt  rntfseM 

For  each  subject  who  participated  in  tbE  experiment,  we  haw  neu- 
nolmaglng  data  collected  as  that  subject  viewed  Images  from  each 
of  the  eight  cl  asses.  The  task  was  to  see  whether,  from  the  brain  data 
alone,  tbE  algorithms  could  predict  what  lypeof  picture  the  subject 
was  viewing.  Top  left  In  Figure  6  shows  the  prediction  accuracy 
of  the  various  classifiers  we  tried.  On  the  whole,  all  thedasslJiens 
exhibit  similar  performance  with  SVM  performing  slightly  better 
on  a  couple  of  tbE  subjects. 

Top  right  of  Figure  G  shows  the  memory  require  men  te  for  all 
tbE  algorithms.  The  CSL  variants  require  sip  id  can  tly  less  mem¬ 
ory  to  store  the  model  Learned  during  tratnlngcompared  to  SVM 
and  KNN.  SVM  requires  a  Luge  number  of  support  vectors  to 
fully  differentiate  the  data  from  different  classes  Leading  to  Luge 
memory  consumption,  whereas  KNN  needs  to  store  all  the  train¬ 
ing  data  In  memory.  For  CSL,  Ef  the  testing  method  is  tree  descent, 
then  the  entire  hierarchy  needs  to  be  kept  In  memory.  For  the 
KNN-on-Leoves  testing  method,  only  the  leaves-  of  the  tree  are 
retained,  rendering  even  a  smaller  memory  requirement  for  the 
stared  madeL 

Bottom  left  of  Figure  6  shows  the  training  time  for  the  CSL 
algorithm  being  an  order  of  magnitude  smaller  than  that  of  SVM. 
KNN  does  not  haw  any  explicit  training  stage.  Finally,  bottom 
right  of  Figure  6  compares  the  recognition  time  for  the  differ¬ 
ent  algorithms,  again  on  a  Log  scale.  The  average  recognition  time 
on  a  new  sample  for  the  CSL  tree  descent  variant  Is  a  couple  of 
orders  of  mapltude  smaller  than  both  KNN  and  SVM.  For  the 
KNM -on-leaves  variant  of  the  CSL  method,  the  recognition  time 
grows  larger  (wh  ile  s4lll  being  sEpificanliy  smaller  than  KNN  or 
SVM).  Therefore  the  fastest  approach  Is  performing  a  tree  descent 
(paying  a  penalty  In  terms  of  memory  requirements  for  storing 
the  modd). 

522.  Ag^&goting  dels  across  subsets 

Since  the  recordings  from  all  the  subjects  have  the  same  dimen¬ 
sionality,  we  can  merge  aft  the  data  from  tbE  different  subjects 
Into  I  Luge  dataset  and  partition  Et  Into  the  training  and  test¬ 
ing  dalasEts.  This  way  we  on  study  the  performance  trends  with 
Increasing  datasets.  The  SVM  system,  unfortunately,  was  unable 
to  run  on  pools  containing  more  than  two  subjects,  due  to  the 
SVM  system’s  high  memory  requirements.  Nonetheless  the  two 
variants  of  the  CSL  algorithm,  and  the  KNN  algorithm,  ran  suc- 
cessfullyon  collections  containing  up  to  firw  subjects"  aggregated 
data. 

The  subplot  on  the  left  of  Figure  7  shows  that  theclasslfkatlon 
prediction  accuracyof  the  different  classifiers  remain  competitive 
with  each  other!  as  we  Increase  the  pool.  The  subplot  on  the  right 
of  Figure  7  shows  the  trend  of  memory  consumption  by  the  dif¬ 
ferent  alprJthms  as  we  increase  the  number  of  subjects  Included. 
Compared  to  standard  KNN,  the  Increase  In  memory  consump¬ 
tion  Es  much  slower  (sub  linear)  for  the  CSL  algorithm,  with  the 
KNN-on- Leaves  variant  of  the  CSL  algorithm  growing  very  slowly. 

Finally,  In  Figure  &  we  examine  Lhc  growth  In  Ltie  average  recog¬ 
nition  lime  with  Jncfeaalnig  pool  size.  The  costs  of  adding  data 
cause  the  recognition  time  to  grow  for  tbE  KNN  alprlthm  more 
than  for  either  variant  of  the  CSL  alprlthm  (either  tree  descant 
or  KNN-an-kaves  versions). 
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Between  th ese  two  CSL  □IgoriLh  m  variants,  the-  latter  exhibits 
some  modEst  time  growth  ad  data  Is-  added,  whereas  the  farmer 
(tree  descent'  version  of  CSL  exhibits  no  significant  Increase  In 
recognition  time  whatsoever  as  more  data  Is  added  to  the  task. 
It  Is  notable  that  the  reason  for  this  is  that  the  tree-  depth  has 
not  Increased  with  Increaslngslzje  of  the  dataset;  that  :1s,  as  more 
data  Is  added,  the  kamed  CSL  tree  arrives  at  the  ability  to  suc¬ 
cessfully  classify  the  data  early  on,  and  adding  new  data  does 
not  require  the  method  lo  add  more  to  the  tree.  Interestingly, 
the  tiees  become  better  balanced  as  we  Increase  the  number  of 
subjects,  but  thelf  sIbk  do  not  Increase.  The  results  suggest  that 
the  CSL  algorithm  Is  better  suited  to  scak  to  extremely  large 


data  sets  than  either  of  the  competing  standard  SVM  or  KNN 
methods. 


ft  ANALYSES  AND  EXTENSIONS 

il  ALC  OH  UHH  COMPLEXITY 

When  k-means  Is  used  tor  cLusLertng,  the  time  complexity  for 
each  partition  ing  b  0(NlK)h  where  N  Is  the  number  of  samples, 
E  Is  the  number  of  partitions  and  l  Is  the  mumixr  of  Itera¬ 
tions.  If  we  in  t  to  he  a  constant  (by  putting  m  upper  limit 
on  II),.  then  each  split  takes  Ol'.Wl .  Since  we  also  put  a  bound 
on  E  (Kmn.)y  we  can  assume  that  each  split  Is  0<N;i.  Further 
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analysis  Is  needed  an  Hue  total  number  of  piths  and  Lhiel:  con¬ 
tribution  to  runtime.  The  maximum  amount  af  memory  needed 
12  far  the  first  xisupcrvlscd  partitioning.  This  Is  proportional 
to  C'fNX").  When  we  have  smalt  X,  the  amount  of  memory  Is 
directly  proportional  to  the  n  am  her  of  data  elements  bring  used 
Ln  training. 

As  mentioned  earlier  the  algorithm  la  LnLrlnstaliy  highly  par¬ 
allel.  Artec  every  unsupervlscd  partitioning,  each  of  the  partitions 


can  be  further  Heated  in  paralleL  However.  In  the  exper Iments 
reported  here,  we  haw  as  yet  nude  no  attempt  to  parallel He  the 
code.,  seeking  Instead  to  compare  the  algorithm  directly  against 
current  standard  SYM  Implement  alio  ns. 

ti  COMPARISON  WITH 07HEH  HI EHAH CHI C At LEAINING 
techmiclues 

The  alnictare  of  the  algorithm  makes  It  very  sEmllar  to  CART  (and 
In  particular,  decision  trees:  3cmtlne,  1 P92)  a  nice  both  families,  of 
algorithms  partition  the  non- Un ear  Input  space  Into  discontinuous 
regions  such  that  the  individual  sub  regions  themselves  provide 
effective  dass  boundaries.  However,  there  are  seven]  significant 
differences. 

•  Perhaps  the  most  substantial  difference  Is  that  decision  trees 
use  the  labels  of  the  data  to  perform  splits .  whereas  the  CSL 
algorithm  partitions  based  on  unsupervlsed  slmllarlty. 

*■  The  CSL  algorithm  splits  In  a  multivariate  fadilon*  taking  Into 
account  all  tbE  dimensions  of  the  data  samples*  as  opposed 
to  decision  trees  where  most  often,  a  single  dimension  whldi 
results  In  the  largest  demising  of  the  data*  Is  used  to  make  splits. 
The  path  from  the  root  to  a  leaf  In  a  decision  tree  Is  a  con¬ 
junction  of  local  decisions  on  feature  values  and  as  a  result 
Is  prone  to  over  hitting.  As  discussed  before,  the  C5L  tends  to 
exhibit  little  overfitting*  and  we  can  understand  why  this  Is  the 
case  (sec  Discussion  In  the  Simplified  Algorithm  section  ear¬ 
lier  ).  The  leaves  can  be  treated  independently  of  the  rest  of 
the  tree  and  KNN  can  be  used  on  them  to  obtain  the  class 
prod  let  ions. 

»  Decision  trees  are  by  nature  a  2  dass  discriminative  approach 
('multlclass  problems  can  be  handled  using  binary  decision  trees: 
Lee  and  Oh,  2Q03)  whereas  the  CSL  algorithm  Is  a  natural 
multlclass  generally  algorithm. 
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Moil  Importantly,  the  goals  erf  these  systems  differ.  The  primary 
goal  of  the  CSL.  algorithm  Is  to  uncover  natur  al  structure  within  Ole- 
data.  The  fact  Out  the  label  -  based  Impurity  o:  dosses  Is  reduced., 
resulting  In  the  ability  to  classify  la  be  led  data,  falls  out  as  a  (very 
valuable)  side  effect  af  tbE  procedure.  The  CSL  algorithm  thus  wflt 
•carry  mat  a  range  of  additional  tasks,  beyond  supervised  classifi¬ 
cation.  that  use  deeper  analysis  of  Oie  underlying  structure  of  the 
data,  not  apparent  through  supervised  labeling  alone. 

O  DISDOVEOT  OF  SIHUCTUBE 

For  purposes  of  Oils  paper  we  have  focused  solely  on  theclasslfka- 
tlon  abilities  of  the  algorithm,  Oiao^a  the  algorithm  can  perform 
many  other  tasks  outside  the  purview  of  classfficalJan.  Here  we 
wil]  briefly  cover  two  Illustrative  additional  abilities:  (1)  uncover¬ 
ing  secondary  structure  of  data,  and  (II)  Localization  of  objects 
within  Images. 

O I.  Hart? dabs* 

Once  a  model  Is  trained,  for  each  training  sample  If  we  do  hier¬ 
archical  descent  and  aggregate  the  posterior  probabilities  of  the 
nodes  along  the  path,  we  get  a  representation  for  the  sample. 
When  we  do  an  agglomer  at  I  ve  clustering on  that  representation,  we 
uncover  secondary  structure  suggesting  uieta  classes  occurring  In 
tbE  dataset.  Figure  9  captures  the  output  of  such  an  aggjomeratlve 


clustering  for  the  recording  of  one  subject  if  Si ).  Here  we  can  see 
extensive  structure  re  I  at  Ions  among  the  responses  to  various  pic¬ 
tures;  perhaps  most  prominent  Ls  a  clear  separation  of  the  data 
Into  animate  and  inanimate  classes.  The  tree  suggests  tbE  struc¬ 
ture  of  Information  that  is  present  In  the  neunaffnagfng  data;  the 
subjects’  brain  responses  distinguish  among  the  different  types  al 
pictures  that  they  viewed.  Related  remits  were  shown  by  I  Fansofi 
et  ai.  (2GD4)f  these  were  arrived  at  by  analysis  of  tbE  hidden  node 
activity  of  a  back  propagation  network  trained  on  the  same  data. 
In  contrast.  It  Is  worth  noting  that  the  CSL  classifier  obtains  this 
structure  as  a  natural  byproduct  of  the  tree-building  process. 

A  task  quite  outside  the  realm  of  supervised  classification  Is 
that  of  localizing,  l.e.,  finding  an  object  of  Interest  within  an 
Image-.  This  task  Is  useful  to  Illustrate  additional  capabilities  of  the 
algorithm  beyond  Just  class  iff:  a  Lion,  making  use  of  the  Internal 
representations  ft  constructs. 

We  assume  for  this  example  that  tbE  clust er In g  component  of 
tbE  algorithm  Is  carried  out  by  a  generative  method  such  as  PLSA 
(Slvlc  et  al.,  2QQ5);  we  then  can  assume  that  the  features  specific 
to  the  object  class  wilt  contribute  to  the  way  In  which  an  Image 
becomes  dustered,  and  that  those  features  wfll  contribute  more 
than  will  random  background  features  In  the  Image. 
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Figure  LC  show*  an  example  of  localization  of  a  face  within  doslerltig  and  for  determ  InLTia.  the  cluster  rr.vTnbcrship  of  a  prepl¬ 
an  linage.  The  BnltRal  task  was  to  classify  [mages  of  faces,  cans.,  ously  ansccn  Image  x.  For  every  duster  4  we  can  obtain  the  pos- 
moLctfcydcs,  and  UrpJanes  I  fro  it.  Caltech  A).  PLSA  was  used  for  ter  lor  probability  jj(zIjc,  w) ,  foe  every  feature  w  In  the  vocabulary. 


FIGURE  -H [An  illustration  dF ufajoct  Inc  ilzalicn  hi  an imago from  fonts  r  groan  rd:alo  li— xro scoras  thrashed and thesa  r rod 

Caltodi-25&.  iJjTha  eng  ra  magE.  i.B— E|  Fbsrtrfn.  oltsI.  arc  nagatjvo  rdsals  oacw-TYashdd  scoroc.  Nobi  I  ha:  although  craor  neb  xmr  n 

baturas  Hcp-aar.  bl  jo.  rod.  rocpactrreljd  shown  sA  IbveJs- 1  through  4  abng  m*  muhpla  Egans.  ttno  pros  an  :e  d  rod  dots  I'Milr-E  tduosi  is  knhsd  orly 

path  r  tha  C&L  tr»  |F|  A  cwashddw  rrao  ol  Iba  accrocara  baturo  sooros.  Ur  roqisrs  I'jurino  I  ho  boo  ogen. 


wfww.FrarrtHrftki.org 


January  MIZ  |  Mima  B  |  Article  BD  |  IB 


31 

Approved  for  public  release;  distribution  unlimited. 


□"arKkasittfcar  ard  Grange* 


Httotwa  :crrputato~i:  d  c:rli:i:-^  i:t:l  bops- 


Thus,  we  can  test  ill  features  in  the  image  In  sec  which  ones  max¬ 
imize  the  pasted®;  Indicating,  strong  Influence  on  the  eventual 
cluster  membership.  The  location  af  those  features  can  then  be 
used  to  identify  the  vicinity  of  the  object 

As  the  path  from  foot  to  leaf  In  the  C5L  hierarchy  is  traversed 
for  a  particular  test  Image,  tbeposterlor  at  a  given  node  determines 
the  contribution  af  the  feature  to  the  branch  selected-  let  y  be  the 
final  object  label  predictlnn  for  image  x. 

Consider  feature^  from  the  vocabulary.  At  any  given  node  at 
height  f  along  the  path  leading  to  prediction  af  y  for  a,  let  df  be 
the  branch  predicted  by  feature/^  Le„  among  all  branches  at  node 
L  the  posterior  for  that  branch  Is  highest  for  that  feature,  dj  is 
actuallya  set  af  labels  that  can  be  reached  at  various,  leaves  using 
the  branch  and  finally  let  the  overall  branch  taken  at  l  be  i/. 

At  level  iff,  can  be  classified  as  positive  If  1  (jif  ==  frJ ) ,  neutral 
IflfdJ  jt  i?')  and  L  (y  £  rf/y,  and  finally,  negative  If  l(d;'  =  fer)  and 
l(T  *  dj).  The  overall  scare  for  jfj  is  a  weighted  sum  Si  af  all  the 
scares  (  negative  features  getting  a  negative  score}  along  the  path. 
Since  we  know  the  locations  af  the  features,  we  can  transfer  the 
scares  ta  actual  locations  on  the  Images  (more  than  one  location 
may  map  to  the  same  feature  m  the  vocabulary).  When  a  simple 
threshold  is  appLled,  we  get  the  map  seen  In  the  final  Image.  The 
window  most  likely  to  contain  the  object  can  then  be  obtained  by 
optimization  of  the  scores  on  the  map  using  branch  and  bound 
techniques. 

7.  CONCLUSION 

We  ban  Introduced  a  novel,  biologically  derived  algorithm  that 
carries  out  similarity- based  hferardilcal  clustering  combined  with 
simple  matching  thus  determining  when  nodes  in  the  tree  are  to 
be  Iteratively  deepened-  The  clustering  mechanism  Is  a  reduced 
subset  of  published  hypotheses  of  thalamocortical  fund  Ion;  the 
match/m  Ismatch  operation  Is  a  reduced  subset  of  proposed  basat 
ganglia  operation;  both  are  descried  In  Granger  (2006).  The 
resulting  algorithm  performs  a  range  of  tasks,  including  Identify¬ 
ing  noLuraJ  unde  dying  structure  among  object  in  the  dataset;  these 
abilities  of  the  al^pwlthm  confer  a  range  of  application  capabilities 
beyond  traditional  classifiers.  In  the  present  paper  we  described 
In  detail  fust  one  circumscribed  behavior  of  the  algorithm:  Its 
ability  to  use  its  combination  of  unsupervEsed  clustering  and  rein¬ 
forcement  to  carry  out  the  task  of  supervised  classification.  The 
experiments  reported  here  suggest  the  algDritbm’i  performance 
Is  comparable  to  that  of  SVM-s  on  this  task,  yet  requires  only  a 
fraction  of  the  resources  of  SVM  or  KNM  methods. 

It  is  worth  briefly  noting  that  the  Intent  off  the  research 
described  here  has  not  been  to  design  novel  algorithms,  but  rather 
to  educe  algorithms  that  may  be  at  play  In  brain  circuitry.  The 
two  brain  structures  referenced  here,  neocortex  and  basal  gan¬ 
glia,  when  studied  In  Isolation,  have  given  rise  to  hypothesized 
operations  of  hierarchical  clustering  and  of  reinforcement  learn¬ 
ing,  respectively  fe.g.,  Sutton  and  Barto,  I WB;  Rodriguez  el  al.. 


20011.  These  structures  are  connected  In  a  loop,  such  that  (striatal) 
reinforcement  learning  can  be  hypothesized  to  selectively  Interact 
with  (thalamocortical)  hierarchies  being  constructed.  Yfc conjec¬ 
ture  that  the  result  Is  a  novel  composite  algorithm  (C5L),  which 
can  be  thought  of  as  Iteratively  constructing  rtdi  representations 
of  sampled  data 

Though  the  algorithm  was  conceived  and  derived  from  analysis 
of  coiUDO-rtr  fatal  circuitry,  the  next  aim  was to  responsibly  analyze 
Its  efficacy  and  costs  and  compare  it  appropriately  against  other 
competing  algorithms  In  various  domains.  Thus  we  Intention¬ 
ally  produced  very  general  algorithmic  statements  of  the  derived 
cortico-strLatal  operations,  precisely  so  that  ( 1}  we  can  retain  func¬ 
tional  equivalency  with  the  referenced  prior  literature  (Schultz 
el  al..,  1 997;  Surl  and  Schultz,  2M I ;  Schultz,  2SH2;  ladrlguez  el  al., 
20CM:  Daw  and  Doya,  2006);  and  (2}  the  derived  algorithm  can 
be  responsibly  compared  directly  against  other  algorithms.  The 
algorithm  can  be  applied  to  a  number  of  tasks;  for  purposes  of 
the  present  paper  we  have  focused  on  supervised  classification 
(though  we  also  briefly  demonstrated  the  utility  of  the  method  for 
different  tasks,  Including  Identification  of  structure  In  data,  and 
localization  of  objects  In  an  Image). 

It  is  not  yet  known  what  tasks  or  algorithms  are  actually  being 
carried  out  by  brain  structures.  Brain  circuits  may  represent  com¬ 
promises  among  multiple  functions,  and  thus  may  not  outperform 
engineering  approaches  to  particular  specialized  tasks  (such  as 
classification}.  In  the  present  Instance,  Individual  components  are 
hypothesized  to  perform  distinct  algorithms,  hierarchical  clus¬ 
tering  and  reinforcement  learning  and  the  Interactions  between 
those  components  perform  still  another  composite  algorithm,  the 
CSL  method  presented  here.  (  And,  as  mentioned,  the  studied 
operations  are  very-reduced  subsets  of  the  larger  hypothesized 
operations  of  these  thalamocortical  and  basal  ganglia  systems;  It 
Is  hoped  that  ongoing  study  will  yield  farther  algorithms  arising 
from  richer  interactions  of  these  cortical  and  striatal  structures, 
beyond  the  reduced  simplifications  studied  In  the  present  paper) 
As  m^ght  be  expected  of  a  method  that  has  been  sdertively  envel¬ 
oped  In  biological,  systems  over  evolution  ary  time,  these  compo¬ 
nent  operations  may  represent  compromises  among  differential 
seieclional  pressures  for  a  range  of  competing  tasks,  carded  out 
by  combined  efforts  of  multiple  distinct  engines  of  the  brain.  This 
represents  an  Instance  in  which  models  of  a  biological  system  lead 
to  derivation  of  tractable  algorithms  for  real-worfd  tasks.  Since 
the  biologically  derived  method  studied  here  substantially  outper¬ 
forms  exlanl  engineering  methods  In  terms  of  efficacy  per  time 
or  space  cost,  we  forward  the  conjecture  that  brain  circuitry  may 
continue  to  prov  ide  a  valuable  resource  from  wti  ich  to  mine  novEl 
algorithms  for  challenging  computational  tasks. 

ACKNOWLEDGMENTS 

This  work  was  supported  In  part  by  grants  from  theOffice  of  Naval 
Research  and  the  Defense  Advanced  Research  Projects  Agency. 


REFERENCES 

A:asn«:,  G„  iad  DeLcn&  M.  I  LSIS). 
M  c  riisMmJd.  ':-"  of  the  primilr 
neE^r lalur  L  Ftmlnbalci  ffap- 
-erUE  of  rlrUaJ  mJaoeidlabk- 


ions.  J.  53, 

34HL-HI& 

Arthur,  D,  xzti  VtUlYllsL-  S  I2D07 j. 
“t-Mcans-  the  aihjr  upc-:-  of 
orrfnl  seeding"  In  SOEM 


.yr:;:ajinr-j  of. Ac  e^biTTifi  Ai meal 
AiCMSiAM  SymjwiAjTi  :n  DUnra 
AitymdHrar  New  Orleans, 
ttenn,  A,  and  Cmtr,  T.  (LWI). 
MMmnm  ::mptalLY  deruxv 


Eilmijcc.  JEEf  E-ari  Jn(  Tftejr? 
41DM-30S4. 

IJIshco.  C.  fL9M).  T<umi  .strmerb 
jfer  Jiumn  ftxcgnlnm:  New  Ktorfc 
Oifm-u  L^iiiiersItY  Pims. 


Froritiws  in  Compvtalbfial  H«jrn  srfanca  |  har:t  yb  carnputertHfis  of  ccrtk  o-rtnstsl  baps 


Jarury  3012 1  Mima  B  |  Article  5C  1 1i 


32 

Approved  for  public  release;  distribution  unlimited. 


□"anckaBhaiar  ard  Granger 


Kn^rtrn  computation:  d  c:rli:i:-^  i:t:l  bops 


Bid,  ft,  Ng, A,  lad  Jordar.  M.  I2D03). 
Latent  l>±flcKJrt  allocation.  |L  .H-Airii 
LamL  Bn.  3, 99-3-3D2Z. 

Ur:imar„  L,  Friedman,  J,  QU-jSh,  K., 
and  Scoe.C.  <  1-904-1.  dLznjfcnrfrn 
and  kegcifHicn  Trees.  UdmcnLGA: 
Waccwonn. 

Brown,  J..  Bulbed  Q,  and  Crosdoerg, 
E.  (19991.  F5ow  the  basal  pna.lia 
use  parallel  enc.-j Lory  and  Inhibitory 
learning  pathway  In  sekrfhdy 
ropnnd  Lc  amespectad  reword  A 
NaimcL  L9,  10502-18511. 

Banbtee,  W.  fl 592!'.  Lear:, .eg  classifica¬ 
tion  Irees.  frur.  ■Vch^ul  2,  65-75. 

Chang.  C,  and  Un,  C.  <2£M Ik  L&- 
m  A  Library  for  Euppcn  Ytasr 
Afflth.'.Tfi.  Available  al:  ttlpj'rtmw. 
3je^nMdnxnrC;llnAt<nTn 

Darken,  C.,  l-£  Moody,  J.  tLS«J.*fts:, 
idapLv:  k-means  duster  Ing  sent 
empirical  resu'ls,'  In  FrooKrir^s  of 
rftr  JEEf  HOW  Cm vfaurc  ISxn 
Dtcgp:CEeE  Press/. 

12m,  H.  (30021.  J4rtn^raf7wmr  lea  ruing 
Arndis  of  zhr  .Dopanrirc  Sre-sn  and 
7 Tteir  BeJmrizml  .'.stpiiaartdui.  ph.D. 
Thols,  Carnegie  Mdkin  Unhreritf, 
Pittsburgh  PA. 

Dm,  IMrand  Dow,  K.  (2D06X  ThtcDoi- 
pnladicaJ  r  raro'dnlogr  oJ  laming 
and  reward.  Qinr.  Cpir.  AVurabki 
It,  199-HR 

Pekh,  A.,  amJ  C ranger,  FL  \10Xk 
The  bmpergen  metric  ooc.- scllvtv 
hypclhesls:  dlvergerrl  performance 
oT  brain  emails  w±h  different 
synaptic  coniuclhtitf  dLstrlbuBnns. 
Mr  Res.  1302.3-13. 

George.  Cl  and  Hwrklns.  J.  l'2D0S). 
Tkjwanis  a  ■malhefraUcai  -Jheorr  cf 
conical  cnkrudrcults.  FUrS  Cmgjur: 
flfcii  5,  41002952.  d  oH :  1 D .  L  3T  L  i";«D  nr  - 
naJ.pcbl.iMOE-32 

Granger,  IL  !  23051.  Engines  of  the  twin: 
[he  oc-TrpicaL.cn a  I  rat  radian  Hi 
nT  human  mfzdkai.  At  .Uqg.  17, 
15-32. 

Granger,  EL  fUllL).  Hw  Urairu  lire 
Prlnripfer  of  CnTTpunammd 
.Veiircsr'enrs  Cerehnnni  The  Dana 
FormdtdJnn,  Available  a c:  hlrpi'i' 
djnLnrjy-ehn-.'cerebmi'deblLji 
pHld-3WS6. 

Griffin,  Cn  A_r  and  Heron  a.  R 

(20071.  Grfra±-2Sa  dpa  Carejurr 
lAimrcT.  California  insllL'ice  nl'T'-sch- 
nolc^gy. 

Carney,  K.,  PresooL.T..  and  Gedscave, 
P.  ■  20011.  A  oorrpccallcnal  mode!  nf 
id..0T.  selection  in  the  bwl  gang!  j: 


I.  a  new  Tcc.dlcnaJ  anatomy.  tiki 
qicTi.  84. 401-410. 

Han&c-iv  5.,  M  aurniLi  T.r  and  Eiaiby,  L 
f 2  D04;.  Co  mb  tain  rial  oo-i*=  ."  veii- 
Iral  temporal  lobe  for  object  izoog- 
nllicn:  Rubv  I'JKUj.  iterteted:  Is 
there  a  fan-  area.7  .‘‘lewoAnqgr  33, 
156-165, 

Haiby,  h  Cobfetei,  Pmey,  M„  Idai, 
A,  SdicoLer.  1,  and  Fielrlnl,  R 
l.]Kil  >.  Q-sLrJcuLec  amdorwiapplng 
r^oHenlahnnsor faces  aidd^ects 
In  -vzitraJ  LempnraJ  mrlei.  Jn  oice 
M3tMS-y3& 

Hoffman, T.i  I  W9'l.  ■FTobahiliiljd*enL 
semanlio  -owing,'  in  JlCdJ?  rPSr 
.‘Ycce  af.'.Tfi  of  rfte  22  ad  AvraiJ 
ArxTnanPnaJ  ACM  cTCTfi  Cnnter- 
mce  cn  Aereamb  lltlT  Denbpmwa 
hi  Ar^TTTvannr  Jleme^ai  Berkdef, 
50-57. 

Hock,  u  nail . . 

bach,  A,  Prase-:,  D.,  Fldcer,  E,  Rot,  j., 
and  5lmo^.!L{2Sa7l.  Aollnn  selection 
and  refinemcjil  In  rnc-cc-rlcaJ  loops 
through  basal  ganglia  and  .cerebel¬ 
lum.  Philos.  Trans.  SL  Sue.  r.cni.  0 
ad  Sd.  36^1573-1583. 

Ilsn,  C.  and  lia^C  (3D02J.  A  -coin  pari - 
sofflof  methods  for  mdtUass  sup¬ 
port  ve-dnr  niMf-Vt  r£EE  17am 
.Mnira'Unc  13,415-425. 

Kraskl,  5.  (IWTj.  Data  wplc-^clcc  using 
sdl-organtcJng  ma.ps.  Aos x  ft^vech- 
nim  ia.ol  Ifaih  Ccngiur. 

■Eng.  Sb:  32. 

Kemp.;.,  and  FmelJ.T  £lffJD).Thea»- 
neijccs  of  the  strLalmm  and  glntms 
palllduM  srnlhesis  and  specd^loQ 
^its.  TtoE.  U.  5k  B  illoF 

il  25^441-445. 

Kosto-,  U.  (1931]!.  SodiasUcDompetrihrc- 
learnlng  ,'£££  mans.  .Vnira'.Vr.m.  2, 

522-529. 

Lebbjs,  A.,  fiorand,  T.,  .Meissner, 
W,  Bergman,  H.,  udi  i-lansel, 
D.  I2D06I.  Com  petition  terween 
feedback  loops  cr.derles  cormal 
and  palhoki^oaJ  drnaj7.es  In  the 
baal  ganghai  V-  .'Jeirmsci  25, 
3567-3585. 

Lee,;.,  and  Oh,  l.  l'2W3/.  "EUnasy  dassl- 
ficat-o-T.  Lee  For  rnuhJ-dass  sdats.f  ■ 
oalioa  protiemsT  In  tCDAR  'flJ  Fw- 
atdi.ifi  cj  sHr  Hernir1!  At;  cm  a  nasal 
iIstifEir.MT  cn  OuKunem  Ana^^j 
oof  .km^'.-icn,  Edinburgh- 

Lee,  T, and  AEcm^nfi!  D.  f2II135L  Hl«- 
archlcal  Bajes^ian  Inference  In  She 
visual  eerie*.  lTju  Sue.  Am.  28, 
1434-144®. 


liord.5.  (19®2).  I.eas:  sepuiesqnanLn- 
Uun  In  pan.  AEEE  Tton.  leg1!  Thany 

H,  LB-I3E. 

Lowe.Q  (2M143.  Distinctive  Image  fea¬ 
tures  rncon  scale  -£war  Ian  L  key  poMs. 
iTre  Ocmf^-.  VU  6Ct  JL-J  Id. 

Marr,  Q  ftfflOL  Won.  MTT  press. 

McCeorge,ATmd  i  IWfll.The 

orpnlzatlcc.  of  the  pmjedioxi  franc 
Lhe  ceridbcii  oonea  In  the  strLAnm  In 
the  raL  ft'eur.csrtcnee  3, 503-537. 

Ng,  A.,  and  .ordan,  M_  <20021.  On  dls- 
crlmlnaL  ve  is.  generatee  daaifiefi. 
a  oomparlsoiB  nr  bglsllc  i  K.ress.on 
and  naiTe  Bares.  eti!  Ju(  Fnaoesi 
^ST.  x  34  I-E4E. 

■D'DotrA.Ty,  ir  Darin,  R,  Prison, 
t,  Cri-chley,  H,  and  Drdz^.  FL 
f2DGQ0.  T^mpofal  o. cere  n  ee  mod- 
ek  aod!  reward-rdated  learning 
In  the  huEocc  brain.  .'fciiTm  38, 
32S-337. 

FiiBHibifcej,  MTand  Hc-j^./j,  T.  fL599k. 
HiesambamJ  models  nfobieol  seoog- 
nlllcn  In  oortH.  .hCxa  N'euwci.  2, 
1B19-1025. 

Ilodrlgnei,  A.,  Wh3tsonr  J.,  md 
C ranger,  EL  (11004).  Derfedion  aid 
analysis  of  basic  oocTfn±aOnnai 
operations  nf  thaiamocurtlcal 
dricuits.  f  Crfu.  iVan^srl  15, 
156-877. 

Ethritx,  W.  <2£M2k  CEttlng  FumuLw  1th 
dopamine  xnd  reward  L'eura.s-  36, 
241-363. 

Ekhdtr,  W,  DaysEv  P-,  Koatagce, 
FL  fL997!>.  A  neural  subslrate  or 
pretjGlon  arc  sxword  5cinuz  175, 
15S3-15W. 

E3vk,  |.,  FLms-dtBT:Erroe,AT7liarfn-ian, 
A_,  and  Freeman W.  (20Q5).  Db&cov- 
erlna.  ubjech  ard  their  boilJun  In 
Images.  AEEEiuc  Cc irf.  Cnmpa.  ViL 

I. 370-377. 

S Srk,  L  and  I-'Js.'-eman,  A.  12 W 3;. 
"Video  Google:  a  test  setrievai 
approach  In  object  nuliT-:~g  In 
vldeoc,'  In  JCCV  FQ3:  ^cceed'.Tjs  of 
rhr  .sflurfi  IEEE  JarenvarALiaxT  Cm^7- 
enor  nr  Ccmpuser  VTncw,  Nke. 

Etephan,  H.  ( 1972).  -EvdniSon  uT  prt- 
maLe  tra.r.  e  a  c  orn  pa  r  £  I  ve  an  ajc-m.  ■ 
cal  apo reach,' In  FLmnrvmTaixfEiu- 
Eunnvan.-  fiSahf/  :f  .POrucie,  ed.  FL 
Tittle  fchioap]:  .Aldlne-Athertunl, 
155-174. 

Etephan,  HL,  Bonchot,  FL,  and  Andf,  a 
fLPTUli.  Daiaocsjecrthebca-r.  anL 
of  var  Inns  brain  ports  In  instil Ivo ra 
and  prtmalH  .  AdVimors  .'.t  Fnma  cr. 
c®-,  239-297. 


Etephan,  H.„  Frahrr,  H .,  and  Banoii.  C. 
f  1 951 New  and  revAed  caca  on  vol¬ 
umes  or  t-.-a_T.  drndures  In  Insect  - 
vures  arc  primates,  ftfu  .pnmn:-;t 
35,  1-29. 

Sirl,  EL,  and  Eifeiitt.  W.  <20O3).  Itm- 
pcrcd  difference  model  repnodnoa 
inL.o-cnbrf  near  a:  Mlyry  i^unal 
Ormp;;.  I3.34L-E6-2 
Eottom,  FL,  and  Barto.  A.  (iSfOjl 
"Time-  derfesLre  cnodkk  of  E^ivIjov- 
lan  ten  force  meat,'  In  Ioann  nf  and 
C:mftu*o,-:nc!!  .M&mHclmcE  fliun- 
ikd-sns-sfAiipnw.NifnKrts,  eds.  M. 
ilabrlel,  aid  I.  Mo-ore  IM IT  Prea), 
497-537. 

Eaton,  FL,  ami  Boro.  A.  ( 19911.  Kxin. 
rmcerYir;  (.arming:  An  JunuduciAm. 
MIT  press. 

Ti±v  Y.,  >osdin,  .M„  Beal,  M.r  and  Blef, 
D.  (2084).  'Coring  ‘dusters  among 
rdaled  gyou[»:hieraid'iioaJdlrlidilet 
processes  '  In  -TcLsei'.sr-s  n/J^ural 
Av_|t<  i  .nonoT  TTd  LSjAsf  Jen  en  is,  k'ai.- 
oourer. 

Ullric_  5.  <20061.  C-b;ed  lec^niLon 
and  Kgjneotalion  by  a  fsaguimt- 
bosed  baenndiy.  J7endj  :'j-?n.  fi± 
i  firfci.  MJ  11.53-64. 

VapnLt,  V.  (19951.  7hr  ASeiott  of  Sra- 
fernsn!  (euTTilrw  Tncrry.  New  York: 
Springer. 

CuaHlDl  uf  Inles™  Statement  The 
authors  dedase  that  the  res-: arch  was 
conducted  in  the  absence  of  any 
■commercial  or  fcranctal  relaBunshlps 
lhac  oould  be  oocotiued  as  a  pooenlia: 
conflict  or  Interest 

Jikceiwd:-  27  A*an-  JOJ  P:  aocepad:  23 
Grater  201  li  pubhihsf  snhn£-  .'7  ,hxn- 
uiry  207Z 

Ciiannc  Chonhushefinr  A  end 
ClunffT  R  (2012}  Cenismc.T  of  h 
nffri'  erfL-ien-  fupc7vtj&i  JoamAsr-  aJgn- 
rlihn:  |Tum  rcmicn'-Kicijrn-of  iwjpi 
From.  C-nripur.  Ntnrosoi.  5:50.  in- 
1033€@ffiram.2QitdKX>50 
7lYi  iimdr  was  ruhmnaf  ru  fluuneo  In 
AenanK1  rnsspum'icns  of  icrriic -srcLirii! 
[Hym  a  jpePorg--  of  FfmteTS  ?□  Gnupu- 
ran  fflsul  Afeira  denoc. 

Ccpmjjni  <0  25&J2  CftoHfriosftfinr  end 
CranffT.  Tfe  is  on  o~  m  r  sj  anf- 
de  i'smrc.vaf  umfer  in:  [rrou  of 
rJir  CrairTre  Gimmsns  XmLirankra  Won 
GsramemuJ  Lirzuse;  wAith  perrr&r  nrn- 
LcmmeTtka!  ta-  d’lsmb^r’d",  and  repru- 
diLTicn  In  nrfieT  /brurra,  prc  rldriJ  rhe 
ivT^rtriir  □  u  ih  ers  and  so  urn:  a  ra  ireifArii 


wfww.FrurTtHnki.org 


Joi-iaiy  20112 1|  Voluma  5 1  Artkda  ED  1 17 


33 

Approved  for  public  release;  distribution  unlimited. 


Appendix  B 


T"];  tItthhj  tli  Fpram  hrogi  nn't^nrig  T. abitr  jtihry 
BFI.AE  T*ChfllCil  Tt iTp h ] t t  Sill  2-^ 


Learning  what  is  where  from  unlabclcd  inia^o1?: 

JuinL  bcalizutiuii  luid  du^lariii^  of  f-ury^rouj-j J  objects 


Ah  hi  nk  ni‘;iriHlr;iJ(hf:l'JST  ■  Tjir+iriiich 

Torrflsnnl  j  Richard  Granger 


A  hs-irrj-icid1  “WhiAr  r\~tn:  it  mrnri.  rn  sriY?  Thr  plnrn  hiaith  flusurj  icmiild  hr.,  tn  hm? 
^■h&n.  is  :vhei-e  by  looking  This  famous  quote  by  D&yid  Marr  [2l|  sullhw  u[>  che 
h'Vy  .LTrT  I  £]*'  vwi>[Hi:  -  I isi-j-ywririjf  '.viiiu:  is  jvivi-j-irr  in  I  I k1  mnrrlH,  imd  vflin-n  il.  ir-,  fnesm 
nnlabelad  image*.  In  Lhbs  paper  w.  lank]*  Libia  cJiallenglng  problem  by  propoaLpg  n 

jjc'nrn-Airjv!  -nrwlc  < if  ^nhhjcsrrt  ivirrmi  hirrn  ;in:l  4  i-^rTihr:  r  n  ^fl-h-innt  jlI.m  rjT-i+hni  hr  ;inhrv 

matically  karu  the  parameters  of  the  model  from  a  coiLection  of  anlabeled  images. 
Oi:r  :i I.  ■■■••:  I  "‘i 1 1  rliscuv£rt$  fit:  nhj**rtK  mid  Lhei  r  .-j;if*J  i;i  Kv+eribs  by  cln.-sl i  :riri£  !■::■ 
serbe:  Lcujlecs  containing  similar  tcregrounca.  Our  approach  simultaneously  yo.vct 
Ccji-  t-lie  haaga  :l  u-il-aii--.  t  1l+:  feregmund  appearance  nmdela  anil  the- spatial  re  g  Lorn- 
euntainuiy  the  oL.iccta  by  optimizing  a  aiEiyk:  likidibnud  function  defined  over  the 
<.!:ilii\:  iniH^t;  l  i  i ] Lts':  1  i i j .  Wnfr  lifctf: ribe  l.v^fi  nOvfr]  rtteilmfk  foi  efficient  fnregmiiiid 

locnliso^LoD;  the  first  method  dees  rwt  require;  any  bottom-up  iioai1/.1  svi'.nh-iitik  ion 
and  disco™  b  the  foreground  region  as  a  contiguous  rectangular  bounding  box. 
Thin  wi  xin il  1 1 1 : : i . h 1 1 1 •: L  i  ::(|  ui  KKi  r;  l,lnt  kin  i  m  ml  ;i :i  ml  li-x:i,ii  mi  i  r  sup^r  pix-ils  gen¬ 
erated  through  a  bottom-up  segmentation  of  the  image,  Uoivcyct,  unlike  previous 
mechnda,  objects  are  ool  Mourned  to  be  eoicapaulared  by  a  single  ae^menl..  Eval¬ 
uation  on  standard  benchmarks  and  comparison  with,  pri-w  methods  demori  tit  rate 
tbal  our  a|upreadh  achieves  stat(^nf-t.be-art  ik-;ilIIh  oti  i.lie  problem  uf  iinsuper-v  L-rfd 
fbrpgrouTLd  local  is  inti  on  and  cLuvtaring- 
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1  Introduction 

Object  categorization  requires  recognizing  the  closes  of  objects  appearing  in  an 
input  photo.  Rather  than  performing  classification  of  tbe  entire  image  ass-  a  whole, 
object  class  recognition  systems  often  operate  by  -decomposing  the  photo  into  dif¬ 
ferent  regions  corresponding  to  the  objects  present  in  the  scene.  Treating  object- 
localization  and  recognition  jointly  allows  such,  methods  to  be  more  robust  to  clut¬ 
ter,  -variations  in  backgrounds,  as  well  as  presence  oF  multiple  objects. 

We  can  distinguish  several  methodologies  for  object  recognition  and  localiza¬ 
tion  on  the  basis  of  the  amount  of  human  supervision  needed  during  training. 
When  the  training  images  are  manually  segmented  into  semantic  regions,  object- 
localization  can  be  formulated  as  the  task  of  densely  matching  regions  of  the  input 
photo  to  the  manually  annotated  segments  of  similar  images  in  the  database  |19|. 
In  order  to  achieve  good  results,  these  methods  require  very  large  collections  of 
annotated  images  so  as  to  maximize  tbe  chance  of  a  close  image  match  in  the 
database.  However,  due  to  tbe  cost  of  collecting  pixel- lab  els.  such  datasets  are 
extremely  time-consuming  to  generate  and  difficult  to  Label  accurately. 

A  second  methodology  involves  the  use  of  datasets  where  only  tbe  object-  of 
interest  is  manually  segmented  in  tbe  training  images.  Typically,  recognition  and 
localEation  is  then  achieved  using  a  combination  of  bottom-up  segmentation  and 
top  down  classification  {'S]!  T3],|23\|50').  But  these  methods  are  computationally 
expensive  to  run  and.  again,  the  requirement  for  detailed  segmentation  in  the 
training  set  is  far  too  onerous. 

An  efficient  alternative  is  object  detection  ([7]  ,[3]).,  which  involves  sliding  a  sub- 
window  clasifier  exhaustively  over  all  rectangular  regions  of  the  test  image  in  order 
to  robustly  localize  the  box  that  is  most  likely  to  contain  the  object.  This  brute- 
force  evaluation  can  be  made  very  efficient  by  using  a  branch  and  bound  strategy 
|1G|  which  allows  to  rapidly  remove  From  consideration  a  large  portion  of  regions. 
These  algorithms  normally  require  the  object  to  be  delineated  using  a  bounding 
box  in  the  training  dataset,  which  is  easier  to  generate  compared  to  full  segmen¬ 
tation.  However,  even  this  form  of  labeling  is  expensive  to  acquire  and  effectively 
restricts  the  size  of  the  training  set  .  Furthermore,  the  sizes  and  locations  of  the 
bounding  boxes  are  typically  cbosen  arbitrarily  by  the  labeler  and  are  consequently 
unlikely  to  be  optimal  for  recognition. 

When,  images  have  labels  indicating  the  objects  present  in  them  but  no  local¬ 
ity  information  for  the  objects,  semi  supervised  methods  can  be  applied  to  Learn 
automatically  tbe  correspondences  between  image  regions  and  tbe  labels  of  the  im¬ 
age.  Most  methods  in  this  genre  use  bottom-up  segmentation,  as  a  preprocessing 
to  produce  candidate  segments,  and  then  perform  top  down  learning  on  the  seg¬ 
ments  ([10|,[2|?[G|).  However  the  main  weakness  in  such  methods  is  relying  on  the 
ill  defined  task  of  bottom- up  segmentation  (based  on  low-level  visual  cues  such  as 
edges  and  texture)  to  segment  images  such  that  objects  or  semantically-coherent 
regions  are  represented  by  a  single  segment.  Thus,  such  approaches  typically  yield 
poor  classification  accuracy.  Recently,  Nguyen  et  aJ.'22fj  and  Daselaers  et  al.[9] 
have  proposed  weakly- supervised  object  Localisation  methods  avoiding  the  need  of 
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bottom-up  segmentation:  the  idea  of  these  methods  is  to  simuJ-.aneousJy  localise 
discriminative  subwindows  in  Use  training  images  and  to  lean]  a  daasiitef  to  rec¬ 
ognize  such  regions.  However,  even  such  method^  require  supervision  in  terms  of 
class  labels. 

to  this  paper  we  contrast  the  traditional  methodologies  for  object  localisation 
and  recognition  outlined  above,  by  presenting  a  fully-unsupervised  method  which 
completely  eliminates  the  need  for  time-consuming  and  suboptimal  human  label¬ 
ing.  The  intuition  behind  our  approach  is  that  objects  can  be  viewed  as  recurring 
foreground  patterns  appearing  as  coherent  image  regions.  Thus,  we  can  formulate 
object  discovery  as  the  task  of  partitioning  an  unlabeLed  codec  Li  on  of  imag®.  into 
K  subsets  {clusters] ,  such  that  all  images  within  each,  subset  share  a  similar  fore¬ 
ground.  In  order  to  obtain  a  method  scalable  to  Large  collections  and  many  classes, 
we  adopt  a  foreground  mask-based  representation  of  objects,  which  enables  fast 
local  Eat  ion  given  tbe  object  model.  Specifically,  we  represent  the  object  in  an 
image  as  a  histogram  of  quantised  local  features  occurring  in  tbe  enclosing  fore¬ 
ground  mash.  We  view  each  object-  instance  as  a  random  variabLe  drawn  from  an 
unknown  distribution  common  to  ad  instances  of  that  object  class.  This  common 
distribution  assumption  constrains  all  foreground  histograms  of  an  object  class  to 
represent  subtle  variations  around  a  prototypical  average  histogram.  Based  on  this 
assumption.,  our  approach  poses  object  discovery  as  a  maximum  likelihood  estima¬ 
tion  problem,  to  be  cprinmed  ewer  the  entire  cbllecUon  of  unlabeled  images-.  We 
present  a  method  that  maximizes  this  objective  by  simultaneously  solving  for  the 
histogram,  model  parameters  of  the  object  classes,  detecting  the  object  instances  of 
each  class  in  tbe  nnlabeled  Imagers,  and  performing  a  soft  semantic  clustering  of  im¬ 
ages-  in  the  data  set .  In  tbe  next  section  we  review  prior  methods  for  unsupervised 
object  discovery  and  discus  their  relation  to  our  approach. 


2  Related  work 

Cl  ass- generic  methods  for  object-  disccwery,  such  as  [I]  and  [14j.  attempt  to  dis¬ 
cover  image  regions  which  are  strong  candidates  for  containing  ohjets  in  them. 
These  methods  operate  on  individual  images  in  a  purely  bottom  up  iashioo.  How¬ 
ever.  tbe  bottom  up  notion  of  'object ness  is  ill- defined  and  hence  methods  which 
can  discover  objects  by  using  a  collection  of  images  by  determining  statist LcaJly 
reoccurring  image  fragments  are  more  likely  to  succeed  at  the  task. 

Lee  and  Grauman  17]  have  proposed  an  approach  to  automatically  localise 
foreground  features  horn  a  collection  of  unlabeLed  images.  By  learning  the  ?sig- 
nijlcance',  weights  of  semi-loca]  features  iteratively  through  image  grouping,  their 
method  determines  for  each  image  which  features  are  most  relevant,  given  the 
image  content-  in  the  remainder  of  the  collection.  While  this  work  successfully 
demonstrates  that  a  mutual  reinforcement  of  object-level  and  feature-level  simi¬ 
larity  improves  unsupervised  image  clustering,  there  is  no  clear  way  of  translating 
feature  weights  into  foreground  localisation  and  abject  extents.  Furthermore,  it 
performs  clustering  from  pairwise  image  matches  and  therefore  the  computational 
cost  at  each  iteration  is  cubic  in  number  of  images.  Finally,  the  algorithm  alter¬ 
nates  between  image  clustering  and  updating  the  foreground  weights  without  a 
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unifying  formal  objective  and  thus,  its  emergence  properties,  sue  unclear. 


Various  semantic  topic  models  ( '1 1]1[3?]1  [f-2]  ,[15]^  |9j)  have  been,  have  proposed 
for  similar  tasks  where  the  location  of  the  object  is  treated  as  a  latent  variable 
to  be  estimated.  However,  most  of  these  methods  are  not  fully  unsupervised  and 
often  resort  to  an  expensive  sliding  window  mechanism  for  object  discovery  with 
unknown  costs  for  detection. 


Our  work  is  inspired  by  the  approach  oF  Russell  et  &L[3d|,  who  ectend  their 
earlier  work  |26|  and  propose  a  fuUy-unsupervised  algorithm  to  disetwer  objects 
and  associated  segments  from  a  large  collection  of  images.  Multiple  segmenta¬ 
tions  are  performed  for  each  image  by  varying  the  parameters  of  a  segmentation 
method,  the  key  assumption  is  that  e^h  object  instance  is  correctly  segmented 
(as  a  single  contiguous  segment)  at  least  once  through  multiple  segmentation  and 
therefore  tbe  correct  segments  corresponding  to  object  classes  occur  more  often 
than  random  background.  This  suggests  that  the  features  of  correct  segments  form 
object-specific  coherent  clusters  discoverable  using  latent  topic  models  From  next 
analysis.  Although  tbe  algorithm  is  shown  to  be  able  to  discover  many  different 
objects,  it  still  suffers  from  its  dependence  on  bottom-up  segmentation  to  come  up 
with  a  tingle  segment  encapsulating  tbe  object,  which  is  ill-posed  particularly  in 
the  case  of  unsupervised  datasets  since  often  it  is  necessary  to  knew  the  category 
of  the  object  in  order  to  reliably  segment  it-  from  tbe  scene.  Their  goal  is  different 
from  ours  in  that  their  method  does  not-  prescribe  a  way  to  cluster  tbe  images  or 
determine  which  regions  in  the  images  correspond  to  image  foregrounds.  Never¬ 
theless,  in  tbe  experiments  we  consider  adaptations  of  their  method  to  our  task 
for  a  quantitative  comparison. 


In  contrast,  we  propose  a  generative  model  of  foreground  formation  that  en¬ 
ables  simultaneous  image  clustering  and  foreground  localization  via  maximum  like¬ 
lihood  estimation.  Unlike  [2d';  our  approach  treats  each  image  as  a  composition  of 
foreground  and  background  wbere  the  foreground  is  ecplained  by  a  single  model 
shared  with  other  images  and  the  background  is  image-specific  and  hence  not 
modeled.  We  treat  the  foreground  mask  as  a  parameter  to  be  estimated  as  part  of 
the  likelihood  optimization.  We  demonstrate  that  Ibis  Leads  to  better  Localization 
and  image  clustering.  Apart  from  the  proposed  unified  framework  of  maximum 
likelihood  estimation  for  the  task.,  the  main  contributions  of  this  paper  are  the 
development  off  two  nwel  methods  far  efficient  locslEatson  of  object  foregrounds 
in  images.  In  tbe  first  method,  the  foreground  is  encapsulated  by  a  rectangular 
bounding  box  thus  obviating  the  need  For  bottom-up  segmentation.  Tbe  second 
method  does  rely  on  bottom-up  segmentation.  However,  the  segments  generated 
are  assumed  to  be  nothing  more  than  super-pixels'.  In.  particular,  we  do  not  as¬ 
sume  that  the  foreground  is  captured  by  a  single  segment.  Hence,  we  overcome 
most  of  the  drawbacks  of  previous  methods  which  tend  to  generalize  poorly  due  to 
their  reliance  on  the  assumption  that  bottom-up  segmentation  will  likely  produce 
object  instance  segments  consistently  across  images  belonging  to  the  same  class. 
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Fig.  1:  Our  generative  model  of  image  formation:  image  ia  obtained  by  llrst 
drawing  its  object  class  (i„);  then  the  appearance  of  the  object  inside  the  fore¬ 
ground  location  (av) is  ge Derate!  from  a  distribution  )  common  to  all  objects 
instances  of  that  class.  The  background,  model  (fljf)  is  assumed  to  change  with 
every  iraag£. 


3  Generative  model  for  unsupervised.  object  discovery 


We  now  describe  our  proposed  generative  model  for  unsupervised  object  discovery. 
We  assume  we  are  given  as  input  a  collection  of  N  unlabeled  images  ii,.  . .  ,js,v, 
with  each  image  containing  one  of  K  objects.  Our  objective  is  twofold:  to  sep¬ 
arate  the  images  into  A  disjoint  subsets  (clusters)  corresponding  to  the  A"  ob¬ 
ject  classes  and  to  localise  the  object  within  each  image.  We  denote  with  Tn 
the  unknown  foreground  mask  enclosing  the  foreground  object  of  image  We 
represent  the  foreground  region  fn  of  image  ^  by  computing  the  un-normalied 
histogram  h[ of  the  visual  words  (in.,  quantized  local  visual  fea¬ 
tures)  occurring  inside  zw  here  W  represents  the  number  of  unique  words  in 
the  visual  code  hook,  which,  as  usual,  is  learned  during  an  offline  stage  horn 
training  images.  We  assume  that  the  foreground  histograms  of  images  belong¬ 
ing  to  the  fc-th  object  class  are  generated  from  a  common  model  defined  by 
parameters  .  Specifically,  let  ln  t  { 1( ^ .  ,Ar}  denote  the  unknown  cluster  la¬ 
bel  of  image  Sm,  which  we  assume  to  he  drawn  from  a  Multinomial  distribution 
with  parameters  a  =  ,irjr}.  Then,  we  model  the  foreground  histogram 

as  a  random  variable  drawn  from  a  Gaussian  distribution  with  parame¬ 
ters  In  order  to  reduce  the  number 

of  parameters  to  be  estimated,  we  assume  the  covariance  Ek  of  each  cluster  Jfe 
to  be  diagonal:  -E*  =  diag(Ajti?. . .  ?  Finally,  each  image  is  assumed  to 

have  its  own  independent  background  model  defined  by  parameters  &&  por  0UJ 
objective  of  object  discovery,  the  background  parameters  can  he  left  unresolved. 
The  complete  generative  model  is  summarne  graphically  in  Figure  1.  We  pro¬ 
pose  to  maximize  the  likelihood  of  this  model  by  marginalizing  over  the  cluster 
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labels,  which  we  treat  as  hidden  variables.  In  other  word*,  our  objective  is  to  find 
parameters  0  =  and  foreground  regions  x  —  {n?.„ . ?Tra}  maximizing 


p(z|T,fl)p(T)=  p(z„|r„,ff)p(i^)  =  []  =*|i„,S)p{i™) 

n=l  n=l  ,fc—  3 


m 


where  p(ire)  is  a  prior  penalEing  unlihely  configurations-  of  the  foreground  mask. 


4  Optimization 

We  can  majc.im.no  the  proposed  penalized  likelihood  via  an  Exportation  Maxinr.iza- 
li on  (EM)  algorithm  alternating  between  estimating  the  distribution  ever  the  clus- 
ter  labels  ln  and  solving  lor  the  foreground  models  and  locations.  Next,  we  show 
how  to  perform  each,  of  these  steps  and  demonstrate  that  our  modeling  choices 
lead  to  efficient  Localization  of  the  object  regions  given  the  foreground  parameters 
ft.  The  penalized  complete  log-lihelibood  of  our  model  is  given  by: 


iV 

£  =  ]or  0 

11=1 

=  1'«  Q  p(^1T„,S(TJ(^|tf)i*rT1| 

ti=1 

=  £  loEPW^.f-.S)  +  l06P(W  -I-  logp(T^)  (3) 

11=1 


The  E-step  of  the  algorithm  involves  calculating  the  latent  posterior  distribution 
-fTuk  =  p(in  =  JfclEkjTn,#)  given  the  current  estimates  For  9  and  t.  It  can  he  seen 
that  this  reduces  to  an  evaluation  of  the  following  equation: 


=  TTfc..y{fa(sTI:Zpa);p-fc.rfc) 

iL  ,  t„):  Pfe. , 


(3) 


The  M-step  requires  raaximEing  the  expected  log-likelihood  -c  £(fl)  >-«  with  re¬ 
spect-  to  ft  and  x.  We  begin  by  writing  tbe  expected  log  Likelihood: 


.v  K 

<  £  >T=  £  £1™* 

m=l  t=  I 

N  K  iV 

+  X,  £?-*’*  +  £  -I-  «™t  (4) 

n=ii=i  n=i 


39 

Approved  for  public  release;  distribution  unlimited. 


LcimiDg  whaL  is  where  &r:m  unl&bdad  :  magus: 


7 


The  update  steps  for  parameters  &  can  be  obtained  by  setting  the  respective  derive 
tives  to  zero.  This  leads  to  the  following  rules: 

1  w 

if"*  tw 

ftk  -  =*! - (6) 

i-Ti=i'hlfc  ti=1 

■Afcn,  '  —  ^  7niC[-^,(-S™5-r,n.)]iP  “  [P*]w)  CV) 

^t.=1  rn±  ti=1 

where  loj^  denotes  the  -tu-tli  entry  of  a  vector  a. 

In  the  M-step  we  also  need  to  update  the  estimate  of  the  foreground  mask  xn 
by  solving  the  following  optimization: 

K 

irgmax  <  £  >^=  arg  max  ^  log  |><*n  H  V  ink  kifrVW 

JE"  *"  *=i 

=  argniajcf]aEp(T„J  -  Y  ^  -  NM*}  (*) 

x'  k=  i  -=  l 

We  now  show  that-  this  objective  can  be  rewritten  in  a  form  that  leads-  to  effi- 
cwDtopUmisitian.Let^  =(ill,...,^w-lT€  H"',e=  € 

H1*  K ,  ft  =  (jj-JY.  -  -  and  Snally  let  us  denote  with  h{xm)  the  vector 

containing  K  copies  of  h„ (z„ , i.e.?  =  \h[zn,xn)T?.. .  1h(zJt.1xim)T]T  e 

Hw  K  .  Then,  we  can  rewrite  t-be  objective  of  eq.  S  equivalently  as  follows: 

KW 

are  max  <  £  ^  =  ars  max  { Ltgjhf^j  -  V  c;  (jh(i„  jjj  -  (A]j  f  1  (9) 

“  J=l 

We  next  introduce  methods-  to  optimise  this  objective  efficiently. 


4.1  Image  foregrounds  os  rectangular  bounding  boxes 

A  papular  way  for  circumscribing  an  object-  in  an  image  is  by  using  rectangular 
bounding  boxes.  Traditionally  for  the  object  detection  task,  the  bounding  boxes  are 
determined  using  an  expensive  sliding  window  method  (p],  [8)).  However,  recently 
Lamport  et  al.  16]  have  introduced  a  branch  and  bound  optimization  procedure 
to  locaice  bounding  bates  efficiently  In  our  drst  proposed  approach  for  deter¬ 
mining  foreground  locality,  we  treat  the  foreground  of  each  image  as  a  contiguous- 
rectangular  region  which  is  represented  by  the  variable  e  X.  Here  X  indicates- 
the  space  of  all.  rectangular  subwindows..  The  foreground  content-  h{znixn)  is  jus-t 
a  histogram  of  all  features  that  occur  within  the  rectangle. 

Consider  eq.  9:  note  Cbat  the  second  term  in  this  objective  is  a  weighted  Eu¬ 
clidean  distance  between  A  and  the  histogram  computed  horn  the  visual 

words  in  sub  window  2„.  Foe  such  term,  we  can  define  a  quality  lower  bound  func- 
tion  over  sets  cf  suhwiudows  as  describe!  by  Lampert  et  al.[16]_  For  simplicity, 
let  us  denote  [^(1,,)^  and  as  A(i)j  audp-j  respectively.  Let  xm'n  and 
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be-  the  smallest  and  Largest  rectangles  in  candidate  set  X.  We  observe  that  the 
value  of  each  histogram  bin  over  a  set  of  rectangles  ^  can  be  bounded  from  below 
and  from  abtwe  by  the  number  of  features  with  corresponding  duster  index  that 
fall  into  Tman  and  Tna“  respectively,  denote  these  bounds  by  h(x)s  and  A(i)j 
respectively.  Ectb  summand  can  now  be-  bounded  from  be  Lew  by 


'  fij  (h(r);  -  fj.j  J1  if  p,  -c 


if  *(*Jj  <*,<**),  (10? 


-Pj)1  if  >  S(t)j 


In  -do r  imp  Le  mental  Lon. we  model  the  first  term.  p(zn)  as  a  simple  3D  Gaussian 
ever  the  relative  width  and  height  of  the  foreground  subwindow,  measured  as  frac¬ 
tions  of  the  image  width  and  height.  Therefore,  the  bound  ewer  sets  of  subwiudows 
can  be  trivially  defined  for  Log  p(zn).  This  implies  that  our  complete  objective  can 
now  be  globally  optimised  over  using  the  branch  and  bound  method  for 

efficient  subwind ow  search  of  [16|. 

4.2  Image  foregrounds  as  a  set  of  super  pixels 

Modeling  foregrounds  as  rectangular  regions  forces  foregrounds  to  be  rigid  and 
contiguous.  In  some  coses,  this  resul  ts  in  the  inclusion  of  random  background  clut¬ 
ter  as  part  of  the  window  which  influences  the  foreground  object  model .  This  is 
undesirable  and  is  particularly  troublesome  for  highly  contoured  objects  and  ob¬ 
ject  class®  with  large  pose  variance.  To  address  this  concern,  we  propose  a  second 
method  of  representing  foregrounds.  Here,  each  image  znj  undergoes  bottom-up 
segmentation  once  at  the  start  of  the  clustering  procedure  and  is  split  into  a  num¬ 
ber  of  appearance- based  segments  {si1e^...s^r[|.  The  number  of  segments,  .If,  is 
large  enough  for  the  image  to  be  deemed  as  over  segmented.  These  segments  are 
called  super  pixels.  Thus,  the  goal  of  Sliding  the  foreground  becomes  equivalent  to 
finding  which  superpixels  may  be  part  of  the  foreground.  An  important  property 
of  considering  an  image  as  a  collection  of  super  pixels  is  that  unlike  'SdyiOl  and 
several  other  approaches,  we  do  not  require  that  the  entire  foreground  object  re¬ 
gion  be  captured  by  a  single  bottom-up  segment.  Instead  we  treat  the  foreground 
to  be  composed  of  a  group  of  super  pixels. 

Fbrmally.  the  foreground  mask  from  figure  1  is  described  by  a  sequence 
of  variables.  }.  We  treat  each  T*n  as  a  variable  such  that  £  (€,  1], 

with  the  interpretation  that  higher  values  imply  that  the  super  pixel  is  to  be 
part  of  the  foreground  region  and  a  value  close  to  0  implies  that  ^  is  assigned 
as  part  of  the  background.  Therefore  the  foreground  image  content  is  defined  as 
MXtJ  =  where  frfsrk)  is  the  histogram  of  features  occurring  in  the 

super  pixel  sln.  Using  this,  we  recast  eq.  9  as 


arg max  <£  >-,=  arg min | 


ti  W  M 


E^C^Wlr^3) 

4=i 


(11) 
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where  SiV  is  the  set  -of  all  pairs  of  neighboring  segments3  in  image  zni  PJ,  is 
the  number  of  pixels  in  segruent  and  /^,  =  Z,  ^  and  a,  i>  are  scalar  values- 
constraining  the  size  of  object  foregrounds  (in  our  experiments  we  set  a  =  0.5S,t  = 
OLK}.  The  first  terra  m  eq.  11  represents  our  choice  of  prior  pt2„ } .  This  term 
penalizes  configurations  where  neighboring  segments  have  widely  differing  values- 
and  thus  forces  foreground  segments  to  be  to  calmed  together.  It  can  be  seen  that 
theeq.  11  is  a  simple  convex  optimisation  objective  when  i'„  is  allowed  to  be  area] 
value  and  hence  can  be  minimized  very  efficiently  using  quadratic  programming. 


5  Implementation  details 

it  Image  representation 

Our  representation  is  based  on  histograms  of  quantised  SIFT  features-  |3Q[.  We 
experimented  with  both  SIFT  descriptors  calculated  densely  ewer  the  entire  im¬ 
age  and  also  those  produced  using  an  interest  point  detector.  Similarly  to  what 
reported  by  the  authors  in  [17],  we  obtained  better  results  using  dense  descriptors 
calculated  at  every  pixel  in  the  image.  Thus,  here  we  present  experiments  based 
only  on  dense  features.  Aa  per  common  practice,  we  quantise  the  SIFT  descriptors 
using  a  vocabulary  of  visual  words  generated  by  running  k-mean&on  a  set  of  SIFT 
descriptors  obtained  from  the  collection  of  input  images.  We  then  learn  a  codeboot 
of  LDA  topics  |4|  learned  over  the  quantised  SIFT  features  via  Gibbs  Sampling 
[13].  Therefore,  each  image  is  viewed  as  a  document  of  visual  words  generated  bom 
a  mixture  of  topics  and  the  Anal  histogram  is  produced  by  assigning  each  quan¬ 
tized  SIFT  descriptor  to  its  most  likely  topic.  In  our  experiments  we  determined 
that  histograms  <wer  a  small  LDA  codebool;  provided  the  most  consistent  results. 


■S.2  Initialization. 

The  method  of  initialization  for  unsupervised  clustering  often  has  a  large  impact 
on  the  quality  of  the  final  results.  The  parameters  to  initialize  in  our  model  are: 
mixture  coefficients  (■*),  histogram  means  tjik}  and  variances  and  fore¬ 

ground  masks  for  all  images  (in). 

We  have  evaluated  bottom  up.  cla&generio  object  detectors  such  as  [1]  for  suit¬ 
ability  for  getting  an  initial  estimate  of  image  foregrounds.  However,  in  practice 
such  methods  are  unreliable.  Therefore,  we  have  developed  a  newel  approach  to 
initialize  object-  foreground  Locations.  We  essentially  perform  a  pairwise  foreground 
matching  for  all  images-  in  the  dataset.  For  each  pair  of  images,  we  find  the  two 
foreground  masks  that  minimise  the  Ll-norra  distance  between  histograms  com¬ 
puted  from,  these  masks.  This  can  be  viewed  as  a  form  of  co-segmentation  |23|, 
aimed  at  finding  the  meet  similar  subwindows  in  the  two  images.  Specifically.  For 

1  Natfl  that,  since  llic  segments  da  noL  -diojiga  after  the  inituLl  innigc  segmentation.  naitliar 
do  neighborhood.  J-datlsnsbip  b«tvKn  segments. 
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where  SiV  is  the  set  -of  all  pairs  of  neighboring  segments3  in  image  zni  PJ,  is 
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value  and  hence  can  be  minimized  very  efficiently  using  quadratic  programming. 
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it  Image  representation 
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experimented  with  both  SIFT  descriptors  calculated  densely  ewer  the  entire  im¬ 
age  and  also  those  produced  using  an  interest  point  detector.  Similarly  to  what 
reported  by  the  authors  in  [17],  we  obtained  better  results  using  dense  descriptors 
calculated  at  every  pixel  in  the  image.  Thus,  here  we  present  experiments  based 
only  on  dense  features.  Aa  per  common  practice,  we  quantise  the  SIFT  descriptors 
using  a  vocabulary  of  visual  words  generated  by  running  k-mean&on  a  set  of  SIFT 
descriptors  obtained  from  the  collection  of  input  images.  We  then  learn  a  codeboot 
of  LDA  topics  |4|  learned  over  the  quantised  SIFT  features  via  Gibbs  Sampling 
[13].  Therefore,  each  image  is  viewed  as  a  document  of  visual  words  generated  bom 
a  mixture  of  topics  and  the  Anal  histogram  is  produced  by  assigning  each  quan¬ 
tized  SIFT  descriptor  to  its  most  likely  topic.  In  our  experiments  we  determined 
that  histograms  <wer  a  small  LDA  codebool;  provided  the  most  consistent  results. 


■S.2  Initialization. 

The  method  of  initialization  for  unsupervised  clustering  often  has  a  large  impact 
on  the  quality  of  the  final  results.  The  parameters  to  initialize  in  our  model  are: 
mixture  coefficients  (■*),  histogram  means  tjik}  and  variances  and  fore¬ 

ground  masks  for  all  images  (in). 

We  have  evaluated  bottom  up.  cla&generio  object  detectors  such  as  [1]  for  suit¬ 
ability  for  getting  an  initial  estimate  of  image  foregrounds.  However,  in  practice 
such  methods  are  unreliable.  Therefore,  we  have  developed  a  newel  approach  to 
initialize  object-  foreground  Locations.  We  essentially  perform  a  pairwise  foreground 
matching  for  all  images-  in  the  dataset.  For  each  pair  of  images,  we  find  the  two 
foreground  masks  that  minimise  the  Ll-norra  distance  between  histograms  com¬ 
puted  from,  these  masks.  This  can  be  viewed  as  a  form  of  co-segmentation  |23|, 
aimed  at  finding  the  meet  similar  subwindows  in  the  two  images.  Specifically.  For 

1  Natfl  that,  since  llic  segments  da  noL  -diojiga  after  the  inituLl  innigc  segmentation.  naitliar 
do  neighborhood.  J-datlsnsbip  b«tvKn  segments. 
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each  pair  erf  images  (s„  zJ)1  we  find  the  pair  of  subwlndows  £  Xx  X  that- 

minimizes  the  Following  objective: 


m** *,J >  -  , TJt} |  li  -  q  | h{z%?  TBJ  >  +  h{z}  ?  TjJ |[i  (13) 

where  ||.|  j  denotes  the  Ll-norm  and  C  is  a  hype  [parameter  trading  off  the  objec¬ 
tives  of  finding  similar  histograms  and  choosing  large  subwindowa  (in  our  imple¬ 
mentation.  C  is  Bet  to  0.2).  It  is  easy  to  see  that  this  objective  can  he  minimized 
using  a  simple  variant  of  the  branch  and  bound  method  described  in  [L&|.  At  the 
end  of  this  pairwise  matching  process,  for  eraib  image  zv?  we  get  N  -  1  candi¬ 
date  Foreground  masks  Rom  this  set.  we  pick  the  3rd  largest  window  by 

area.  The  inriiitlon  behind  this  choice  is  that  close  matches  will  result  in  larger 
windows  and  that  the  largest  windows  probably  contain  background  regions  due 
to  matching  to  neax-d up h cates.  The  same  initial  windows  were  used  for  both  the 
foreground  localization  methods  described  in  this  paper.  While  it-  is  true  that  the 
cost  o I  initialization  is  quadratic  in  the  number  of  images ,  we  stress  that  its  a  one 
time  cost-  unlike  most  competing  methods  (such  as  [It]),  where  each  iteration  has 
a  quadratic  cost.  Once  initialized,  the  iterations  of  our  EM  algorithm  have  linear 
cost. 

For  initializing  the  mixture  parameters,  we  tried  a  variation  of  careful  seeding 
|3'  which  was  robust  against  outliers. 


6  Experimental  results 

There  are  very  few  published  quantitative  evaluations  on  the  task  of  unsupervised 
clustering  and  foreground  localization.  In  this  paper,  we  benchmark  the  perfor¬ 
mance  of  our  proposed  method  principally  against  the  results  published  in  [17] 
[FF)?  which  reports  on  the  same  task.  We  do  not  compare  directly  to-  the  methods 
described  In  '29]  as  these  algorithms  do  not  consider  the  problem  of  object  localiza¬ 
tion  and  instead  perform  image  clustering  merely  based  on  global  features  calcu¬ 
lated  from  the  entire  image.  Instead  we  include  as  baselines  a  mixture  of  gausiaus 
model  applied  to  whole  images  (GMM-wbde)  as  well  as  grouud  truth  bounding 
boxes  (GMM-GT),  to  show  the  benefits  provided  by  our  foreground  localisation 
methods  in  the  clustering  results.  We  have  also  applied  the  gaussian  mixture  model 
on  bounding  boxes  derived  using  the  bottom  up  method  described  in  1]  (GMM- 
Obj).  Finally,  we  have  interpreted  the  method  described  in  [2d]  (Multi-Seg)  and 
applied  it  to  our  task. 

In  [17],  the  authors  have  evaluated  their  method  on  the  MSRC-vl  dataset  and 
two  subsets  (a  4-class  and  a  LO-cLass  collection  )  of  the  Caltech  1Q1  dataset.  Please 
refer  to  that  paper  for  details  on  the  datasets.  Here  we  report  our  findings  using 
exactly  the  same  experimental  set-up  and  sets  of  images-  For  all  datasets,  we  pick 
the  number  of  foreground  clusters,  K ,  to  be  equal  to  the  number  oF  classes.  We 
performed  all  experiments  using  a  codehook  of  -SG  LDA  topics  computed  from 
500  SIFT  words.  For  the  segment  selection  method  oF  Foreground  Localisation,  we 
generate  20  bottom-up  segments  for  every  image  using  an  implementation  of  nor¬ 
malized  cuts  [25\ 
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Fig.  2:  Tbe  quab«y  of  image  clustering  in  terms  off  the  F-Measure  metric  fat  tbe 
three  datasets.  The  compared  rn.elh.cda  are  CMM  applied  to  full  images  (GMM- 
Wbole),  ground  truth  subwindows  EGMM-GT)  iod  object  bmces  derived  using 
|]'  (GMM-Qbj).  Tbe  plots  also  include  results  for  Multo-Segpd],  FF[17f?  and  our 
proposed  algorithm  of  joint  clustering  and  localisation  {subwindow  discovery  as 
well  as  segment  selection) 


6. 1  Quality  of  image  clustering 

We  begin  by  evaluating  the  quality  of  clustering  as  F- measure  metric  with  respect 
to  tbe  .ground  trutli  class  labels:  F  =  ^TnaTjF^i,  j),  where  Nr  is  the  number 
of  images  belonging  to  clast  i,  F'(frJ)  =  P[tJ)  and  JZ(l,  j) 

denote  precision  and  recall,  respectively,  measured  fair  class  1  and  cluster  y  Tbe 
F-measure  is  a  good  index  of  cluster  puriiy  with  high  values  indicating  that  each 
cluster  contains  objects  predominantly  from  one  class.  Figure  2  summarises  tbe 
results  obtained  on  ad  three  data  sets. 

Tbe  standard  Gaussian  mixture  model  (GMM)  has  been  evaJmted  in  different 
settings,  one  of  them  using  whole  imaged  (GMM-whole).  For  the  caltech  subsets 
where  ground  truth  is  available  in  ibe  form  of  bounding  traces,  we  have  also  tested 
the  method  using  only  tbe  image  content  within  the  ground  truth  foreground  sub- 
windows  (GftiM-WGT).  Tbe  generic  object  detector  of  [1|  provides  for  each  image, 
the  bounding  trac  with  tbe  highest  probability  of  corresponding  to  an  object.  Tbe 
graphs  show  tbe  result  of  applying  GMM  on  tbe  image  content  lying  within  these 
boxes  as  well.  Rom  this  Figure  we  see  that  our  approach  greatly  outperforms 
GMM  using  full  images  (with  the  segment  selection  procedure  for  foreground  la- 
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calization  marginally  outperforming  subwindow  disctwery  method).  Furthermore, 
somewhat  surprisingly,  our  approach  also  does  much  better  than  clustering  applied 
to  the  foreground  ground  truth  subwindows.  We  speculate  that  this  is  because  the 
manual  annotations  are  subjective  and  unreliable.  Particularly  in  classes  with  high 
degree  of  variance,  the  human- selected  boxes  might  work  against  the  clustering  at¬ 
tempt  as  the  content  exp  reseed  within  the  foreground  regions  of  images  within  the 
same  class  might  not  he  similar.  On  the  other  hand,  unsurprisingly,,  the  results-  of 
applying  GMM-Qbj  are  poor  since  determining  objects  from  a  single  still  image 
is  an  LI l  defined  task 

However,  we  chiefiy  compare  against  the  results  of  the  “foreground  focusT  f  FF) 
method  described  in[17]  and  the  multiple  segmentations  (Mulri-Seg)  method  of 
|2J|.  Significantly,  our  system  also  outperforms  the  results  reported  in  [It]  despite 
their  algorithm  using  a  sophisticated  semi-local  representation  encoding  relative 
location  of  features  in  spatial  neighborhoods.  The  difference  in  performance  is 
especially  noticeable  on  the  most  challenging  MSRC-vl  data  set.  which,  contains 
objects  at  different  scales  and  in  ddferent  positions  within  the  image. 


6.2  Foreground  initialization 
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D.S4E- 
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MSEtC-v  9 

0JJTQ 

D.iflre 

G-iTT 

Table  1:  FMeasure  for  different  initialisation,  methods 


We  have  evaluated  several  different  ways  of  initializing  the  foreground  masks 
for  the  images.  In  particular,,  we  report  for  3  different  methods:  Initializing  masks 
to  he  full  images,,  Initializing  masks  to  be  the  most  probable  region  in  the  image  to 
contain  an  object  as  determined  by  a  class  generic  object  detector  |l]  and  finally 
initializing  foregrounds  using  the  pairwise  image  cosegmentation  method  described 
in  previous  section .  A  summary  of  the  results  in  terms  of  clustering  quality  is  given 
in  Table  6.2.  From  the  table  it  is  clear  that  the  cosegennentatlon  approach,  despite 
the  high,  cost  provides  the  best  results  consistently. 


6.3  Foreground  Localization 

We  now  proceed  to  evaluate  our  approach  in  terms  of  object  localization  ^cu¬ 
racy.  In  |11]T  the  authors  determine  the  quality  of  the  foreground  Localization  by 
examining  the  normalized  sum  of  the  weights  inside  the  ground  truth  foreground. 
While  their  performance  on  this  metric  does  indicate  that  the  foreground  features 
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Fig.  3:  Average  localisation  scores  achieved  by  our  methods  on  all  images  from 
each  ground  truth  class  in  the  4i-class  and  the  10-class  subsets  of  Caltech  101.  We 
also  show  the  Localization  scores  achieved  by  MulO-Seg.  Please  see  tact  For  more 
details. 


get  higher  weigbt  than  bade  ground  features.  there  is  no  clear  way  of  determin¬ 
ing  the  actual  Locality  and  extent  of  the  foregrounds  in  the  images.  Pur  therm  ore. 
with  Lhar  metric,  it  is  possible  to  get  a  high  score  by  having  just  a  few  very 
highly  weighted  foreground  features.  Instead,  it  is  useful  for  many  applications  to 
determine  the  actual  location  and  she  of  the  foreground.  Our  algorithm  gener¬ 
ates  a  natural  solution  to  this  requirement  in  the  form  of  bounding  bates  For  the 
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Foreground.  when  the  localisation  method  is  subwindow  discovery.  and  foreground 
s^meuts  2,  when  the  localisation,  method  is  segment  selection.  We  measure  the 
quality  of  the  foreground  localisation  by  using  a  metric  commonly  used  in  object- 
detection:  Jn  =  are,a(!xtaf\3ST}/ar€a{2jl\Ji^r}  where  s?1™  it  the  ground  truth 
for  the  object  in  image  fi.  Fbr  subwindow  disctwery  localisation  method,  we  use 
the  bounding  hoc  ground  truth  prtwi ded  for  the  images  in  Caltech  1 0  L  and  for 
segment  selection  method,  we  use  the  ML  object  contour  ground  truth  provided. 
Figure  3  shows  the  localisation  scores  achieved  with  our  method  on  all  images  of 
the  i-cLass  and  the  lfl-daoi  subsets  of  Caltech  101 .  We  also  include  the  localisation 
scores  achieved  by  [24|.  It  is  clear  that  the  Foreground  localisation  by  the  segment 
selection  method  is  superior  to  subwindow  discovery,  however,  it  does  not  make  a 
significant  difference  in  terms  of  F- measure  scores.  We  speculate  that  this  may  be 
due  to  the  restricted  nature  of  the  dataset  with  highly  correlated  background  con¬ 
tent  appearing  in  the  discovered  subwindows  which  aid  in  clustering  foregrounds 
correctly.  While  studying  the  scores,  we  want  to  emphasize  that  these  are  calcu¬ 
lated  with  respect  to  the  manually  annotated  ground  truth.  As  we  have  already 
seen  in  the  case  of  bounding  bates,  they  are  somewhat  arbitrary.  In  our  method. 
Foreground  detection  is  optinmed  for  image  clustering.  So  It  is  reasonable  to  get 
foreground  which  are  inconsistent  with  the  ground  truth,  but  nevertheless  play  a 
role  in  improving  image  clustering. 


A  brieF  note  on  our  interpretation  of  the  methods  in  [34]:  We  point  out  that 
this  method  was  designed  for  a  different  task:  it  does  not  explicitly  cluster  the 
images  or  specify  which  segments  are  foregrounds.  Nevertheless,  we  tried  adapting 
this  method  to  work  on  our  task  in  two  different  wa>s:  {a]  We  ran  the  code  of  |21| : 
for  each  image  I,  multiple  segmentations  were  computed  and  a  topic  model  was 
flit  to  the  segments.  Cluster  membership  was  determined  as  the  topic  (TV)  of  the 
segment  (S^jb)  with  the  smallest  KL  divergence  to  its  topic.  Then,  to  localise  the 
foreground,  we  selected  all  segments  having  Tj  as  the  most  probable  topic  from 
the  segmentation  containing  Sbcwi-  (b)  We  used  the  super-pixels  oF  our  method 
as  input  to  \24\  and  then  applied  the  procedure  described  in  (a)  for  clustering 
and  Localisation  (we  also  tried  using  the  most  frequently  occurring  topic  as  cluster 
membership  criterion,  with  no  implement  in  performance).  We  have  included 
the  results  for  (a)  in  the  plots  in  Figures  2  and  3.  The  results  for  fb]  are  very 
similar.  In  short,  both  the  cases  yielded  much  lower  accuracy  than  our  approach. 


Finally,  our  algorithms  are  quite  fast  thanks  to  the  very  efficient  foreground  lo¬ 
calisation  methods.  For  instance.,  on  Caltech  t  subset,  the  JiM  approach  based  on 
branch- and -hound  completes  all  its  iterations  in  300  seconds  whereas  the  segment- 
selection  method  runs  in  40  seconds.  Figure  >1  shows  some  examples  of  foreground 
prediction  for  our  method  both  in  terms  of  discovered  subwlndows  and  selected 
bottom-up  segments.  Please  refer  to  supplemental  data  for  more  visualisations. 


-  super  pixels  with  high,  foir^round  scorn*.  Wc  deem  a.  suparpixd.  to  Ib«  pul  dt  the 
foreground  ol  the  and  of  the  EM  rim  id  &-3 
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L'lirriiri.s;  m  haL  in  where  &r:m  unl&bdad  imagos: 


Fig.  4:  (a):  Examples  of  foreground  prediction  in  imagGS  from.  Lbe  IG-cIoes  Eubset 
of  Caltech  10  L.  Image  on.  tbe  Left  of  each  pair  shows  Lbe  super  pixels  obtained 
through  bottom-up  segmentation.  The  box  in  bine  and  Lbe  contour  in  red  are  Lbe 
ground  truth  For  the  abject  location  in  the  image.  The  image  on  the  right  of  each 
pair  shows  the  foreground  discovered  as  a  collection  of  super  pixels  (selected  if 
xa„  >  0.3)  by  Lbe  segraenL  selection  method  of  localisation.  The  bme  in  .green  is  Lbe 
foreground  extent  predicted  by  Eubwindow  disctwery  method.  (b):  Sample  results 
for  MSFLC-vl  dataset. 


T  Conclusions 

Uusupervised  foreground  discovery  is  an  Important  but  difficult  means  of  extract¬ 
ing  structure  from  large  unlabeled  image  datasets.  In  this  work.,  we  have  developed 
a  probabilistic  method  to  perform  simultaneous  image  clustering  and  foreground 
localization  in  unlabeled  collections.  We  have  shown  that  harnessing  tbe  natural 
synergy  between  the  two  tasks  Leads  to  imprwed  performance  at  both  tbe  tasks.  In 
the  process,  we  have  formulated  two  newel  methods  for  discovering  and  represent¬ 
ing  object  foregrounds  by  associating  and  efficiently  estimating  latent  variables 
corresponding  to  bounding  boxes  and  image  segment  sets.  Our  method  can  effi¬ 
ciently  localise  object  foregrounds  without  resorting  to  expensive  sliding  window 
mechanisms.  We  note  that  our  assumption  that  each  image  contains  one  of  K 
objects  and  tbe  simplicity  of  our  appearance  model  allow  us  to  cast  foreground 
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clustering  and  localisation  elegantly  as  a  single  joint  optimisation.  something  I tas. 
never  been  done  until  now.  FVutbenDOf e.  we  empirically  show  that  the  approach 
outperforms  methods  that  make  more  complex  assumptions  but  that  then  have 
to  resort  to  alternation  between  distinct  objectives  (e.g.n  [It])  or  to  a  two-step 
Eolnlion  {tg.,  [2d])  to  solve  the  problem.  We  believe  there  is  high  value  in  sim¬ 
ple  models  shown  to  perform  web  in  pt^lice.  In  the  future  we  are  interested  in 
extending  the  work  to  videos  where  the  task  is  a  natural  lit.  Onr  probabilistic  for¬ 
mulation  also  enables  straightforward  integration  of  nett-visual  cues  such  as  tact 
or  tags  associated  to  the  images,  which  may  yield  more  semantically  meaningful 
clusters.  The  software  implementing  our  algorithm  will  be  made  available  upon 
publication. 
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How  Brains  Are  Built:  Principles  of  Computational  Neuroscience 
By  Richard  Granger,  Ph.D. 


Bashar  1  LinaPhoco  enjifl's  ChokeGetty  Bruges 


Editor1  s  note:  The  goal  of  computational  neuroscience  is  to  understand  the  brain  jnd  its  TTwhani-Tm--.  well 
enough  to  artificially  Emulate  their  functions.  In  some  areas,  like  he  arm  e.  vision,  and  prosthetics,  there 
have  been  great  advances  in  the  field.  Yet  there  is  still  much  about  the  bran  that  is  unknown  and  therefore 
cannot  he  artificially  replicated:  How  does  the  bram  use  language,  make  complex  associations,  or 
organize  learned  experiences?  Once  the  neural  pathways  responsible  for  these  and  many  other  functions 
are  fitEly  understood  and  reconstructed,  we  will  hare  the  ability  to  build  systems  that  can  match — and 
maybe  even  exceed — the  brain's  capabilities. 


Article  available  online  at  hTTTT 
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iElf  I  cannot  build  it  I  da  not  under stand  it.™  Sc  said  Nobel  laureate  Richard  Feynman.  and  by  bis 
metric,  we  understand  a  bit  about  physics.  less  about  chemistry,  and  almost  nothing  about  biology. 1 

When  we  fully  under  s  (and  a  phenomenon,  we  can  specify  its  entire  sequence  of  events,  causes, 
and  effects  so  completely  that  it  is  possible  to  fully  simulate  it  with  all  its  internal  mechanisms  intact. 
Achieving  that  level  of  understanding  is  rare.  It  is  commensurate  with  constructing  a  full  design  for  a 
machine  that  could  serve  as  a  stand-in  for  the  thing  being  studied.  To  understand  a  phenomenon 
sufficiently  to  fully  simulate  it  is  to  understand  it  SiXftjpUt&tfQflst fy. 

'"Computation"  does  not  refer  to  computers  per  se;  rather  Lt  refers  to  the  underlying  principles  and 
methods  that  make  them  went.  As  Turing  Award  recipient  Edsger  Dijkstia  said,  computational  science  Lis 
no  more  about  computers  than  astronomy  is  about  telescopes.1^  Computational  science  is  the  study  of  the 
hidden  rules  underlying  complex  phenomena  horn  physics  to  psychology. 

Computational  neuroscience,  th^n  h.i-  the  Jim  of  nn  p  brains  sufficiently  tfeII  to  be 

able  to  simulate  their  functions,  thereby'  subsuming  the  twin  gods  of  science  and  engineering:  deeply 
understanding  the  ititvpt  workings  of  our  brains,  and  being  able  to  construct  simulacra  of  them.  As  simple 
robots  today  substitute  for  human  physical  abilities,  in  settings  from  factories  to  hospitals,  so  brain 
engineering  will  construct  stand-ins  for  our  mental  abilities — and  possibly  even  enable  us  to  fix  our  brains 
when  they  break. 

B  rains  and  Their  Construction 

Brains,  at  one  level,  consist  of  ion  channels,  chemical  pumps,  specialized  proteins.  At  another 
level,,  they  contain  several  types  of  neurons  connected  via  synaptic  junctions.  These  are  in  turn  composed 
into  networks  consisting  of  repeating  modules  of  carefully  arranged  circuits.  These  networks  are  arrayed 
in  inieractmE  brain  structures  a-nd  systems,  each  with  distinct  internal  wiring  and  each  carrying  out 
distinct  functions.  As  in  most  complex  systems,  each  level  arises  from  those  below  it  but  is  not  readily 
reducible  to  its  constituents.  Out  understanding  of  an  organism  depends  on  our  understanding  of  its 
component  organs,  but  also  on  the  ongoing  interactions  among  those  parts,  as  is  evident  in  differentiating 
a  living  organism  from  a  dead  one. 

For  instance,  kidneys  serve  primarily  to  separate  and  excrete  toxins  from  blood  and  to  regulate 
chemical  balances  and  blood  pressure,  so  a  kidney  simulacrum  would  entail  a  nearly  complete  set  of 
rWiif  al  and  enzymatic  reactions.  A  brain  also  monitors  many  critical  regulatory  mechani sms  jtvH  a 
complete  understanding  of  it  will  include  detailed  chemical  and  biophysical  characteristics. 

But  brains,  alone  among  organs,  produce  thought,  learning,  recognition.  No  amount  of 
engineering  has  yet  equaled,  let  alone  surpassed,  brains1  abilities  at  these  tasks.  Despite  huge  efforts  and 
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large  budgets,  we  have  no  artificial  systems  that  rival  humane  ar  recognizing  faces,  nor  understanding 
natural  languages,  nor  Learning  from  experience. 

There  are,  then,  crucial  principles  that  brains  encode  that  have  so  far  eluded  the  best  efforts  of 
scientists  and  engineers  to  decode.  Much  of  computational  neuroscience  is  aimed  directly  at  attempting  to 
decipher  these  pnnciples. 

Today  we  cannot  yet  fully  simiilate  every  aspect  of  a  kidney,  but  tre  have  passed  a  decisive 
threshold:  we  can  build  systems  that  replicate  kidney  principles  so  closely  that  they  can  supplant  their 
function  in  patients  who  have  suffered  kidney  loss  or  damage.  Artificial  kidneys  do  not  use  fire  same 
■substrate  as  real  kidneys;  circuits  and  micro  fluidics  take  fire  place  of  cells  and  tissue,  yet  they  carry  out 
operations  that  are  equivalent,  and  lifesaving,  to  the  human  bodies  that  use  them.  A  primary  long-term 
goal  of  computational  neuroscience  is  to  derive  scientific  principles  of  brain  operation  that  will  catalyse 
the  comparable  development  of  prosthetic  brains  and  brain  parts. 

Do  We  Knaw  Enough  About  Grains  la  Build  Them? 

As  with  any  complex  system,  in  fire  absence  of  full  computational  understanding  of  the  brain,  we 
proceed  by  collecting  constraints:  experimentally  observable  data  can  rule  out  potential  explanations.  The 
more  we  can  rule  out,  the  closer  we  are  to  hypotheses  that  cm  account  for  ihe  facts.  Many  constraining 
observations  have  usefully  narrowed  our  understanding  of  Low  mental  activity  arises  from  brain  circuitry; 
these  can  be  organized  into  five  key  categories. 

B  r  a  i  n  CGI  n  pu  n-E  nt  a  1 1  Dnl  etry :  Remarkably  tight  relationships  hold  between  a  brain's  overall  size  and  fire 
size  of  its  constituent  components .  Just  knowing  the  overall  brain  size  of  my  mammal,  we  can  with  great 
precision  predict  the  size  of  all  component  structures  within  fire  brain.  Thus.,  with  few  exceptions,  brains 
apparently  do  not  and  cannot  choose  which  structures  to  differentially  expand  or  reconfigure.  -1 1  So,  quite 
surprisingly;,  rather  than  a  range  of  different  circuits,  or  even  selective  resizing  of  brain  components, 
Kirman  brains  are  instead  largely  built  from  the  same  components  as  other  mammalian  brains,  in  fin*  same 
circuit  layouts,  with  highly  predictable  relative  sizes.  Apparently  a  quantitativ  e  change  (brain  size)  results 
in  a  qualitative  one  (uniquely  human  computational  capabilities). "  1 ' 

T BlBriDE pha I i C  Unitor rn ity:  Circuits  throughout  the  forebram  (telencephalon)  exhibit  notably  similar 
repeated  designs.0-1 writh  fewr  exceptions. 34-19  including  some  slightly  different  cell  types,  circuit 
structures,  and  genes.  Yet  brain  areas  purported  to  underlie  unique  human  abilities  (e.g.,  language)  barely 
differ  from  other  structures;  there  are  no  extant  hypotheses  of  how  the  modest  observed  genetic  or 
anatomical  differences  could  engender1  exceeding! v  different  functions.  Taken  together,  these  findings 
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intimate  the  existence  of  a  few  elemental  cone  computational  functions,  that  are  re-used  for  a  broad  range 
of  apparently  different  sensory  JT>d  cognitive  operations. 

Anatomical  and  pliysioloflical  iilfl precision:  Evidence  suggests-  that  neural  components  are  surprisingly 
sloppy  (probabilistic)  in  their  operation,  very  sparsely  connected,  low  ^precision.  and  estraof dinarily 
sIow.M"Ji  despite  eicbibitinE  careful  timi-ng1  under  experimental  conditions. Hirtw  brains  are  far 
more  precise  than  we  yet  understand  or  else  they  carry  out  families  of  algonlii  nr  whereby  precise 
computations  arise  from  imprecise  components.^1  If  so,  this  greatly  constrains  the  types  of  operations 
that  any  brain  circuits  could  be  engaged  in_ 


T  aSk  speciiicalion:  Though  artificial  telephone  operators  field  phone  inquiries  with  impress  ire  no  ice 
recognition.  we  know  that  they  could  do  far  better.  The  only  reason  we  know  this  is  that  human  operators 
substantially  outperform  them:  there  are  no  other  formal  specifications  whatsoever  that  characterise  the 
voice  recognition  task. 12  "-1  Engineers  be^n  by  believing  that  they  understood,  the  task  sufficiently  to 
construct  artificial  operators.  It  has  turned  out  that  their  specification  of  the  task  does  not  match  the 
actual  still  highly  elusive  set  of  steps  that  humans  actually  perform  m  reeognhang  speech.  Without 
formal  bi  -.k  specifications,  the  only  way  to  equal  human  performance  may  be  to  come  to  understand  the 
brain  mechanisms  that  grve  rise  to  the  behavior. 

Parallel  processing!:  Some  recognition  tasks  take  barely  a  few  hundred  milliseconds/4' '  corresponding 
to  no  more  rtun  hundreds  of  serial  neural  steps  (of  milliseconds  each),  strongly  indicating  myriad,  neurons 
acting  in  parallel1*  imposing  a  very  strong  constraint  on  the  types  of  operations  that  individual  neurons 
could  be  earning  out.  Yet  parallelism  in  computer  science,  even  on  a  small  scale,  such  as  two  or  three 
simultaneous  operations,  has  proven  very'  elusive.  Why,  for  instance,  don’t  our  dual -core  or  quad-core 
computers  run  two  or  four  times  faster  th^n  single-care  systems?  The  (painfully  direct)  answer  is  that  we 
simply  do  not  yet  know  howF  to  divide  most  software  into  parts  that  can  effectively  exploit  the  presence  of 
these  additional  hardware  elements.  Even  for  readily  p ar ail eLiz able  software,  it  is  challenging  to  design 
hardw-are  that  yields  scalable  returns  as  processors  are  added. 37  5  It  ls  increasingly  possible  that  principles 
of  biain  architecture  may  help  identify  novel  and  powerful  parallel  machine  designs. 

From  Circuits  to  Algorithms  to  Prosthetics 

There  are  several  promising  instances  in  which  different  laboratories  (even  laboratories  that  are 
competing  with  each  other)  have  arrived  at  substantial  points  of  agreement  about  what  certain  brain  areas 
are  likely  doing.  A  notable  success  story  arises  fr  om  studies  of  the  basal  ganglia,  which  takes  fwo  kinds  of 
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input].:  sensory  information  from  the  neocortex.  and  "reward"  and  "punishment"  information  an  sing  from 
external  stimuli.  We  are  -close  to  computationally  nndHBfaiidiiijg  thri  Ijrgn  rtnanV  of  the  biaiiL  which 
apparently  canies  out  just  one  of  our  primary  learning  abilities:  our  slow  "trial  and  error"  1-e amine 
(studied  m  computational  neuro  science  as  'reinforcement  learning’7).  underlying  our  ability  to  acquire 
such  -deiTk  as  riding  a  bite. Wl 

In  addition,  there  is  a  growing  consensus  that  circuits  in  die  neocortex,  by  far  the  largest  set  of 
brain  structures  in  Iranians,  carry  out  another.,  quite  different  kind  of  learning:  the  ability  to  rapidly  learn 
new  facts  and  to  organize  newly  acquired  knowledge  into  vast  hierarchical  structures  that  encode  complex 
relationships,  such  as  categories  and  snbcategories.  episodes,  and  relations.2 

And  these  two  systems  are  connected  to  each  other,  via  far-reaching  coitico^basal  ganglia  (aka 
cortica- striatal)  loops  The  basal  ganglia  system  carries  out  the  computational  operations  of  slriLl 

learning  (reinforcement  leanring)  while  cortical  circuits  computationally  construct  vast  hierarchies  of 
facts  Tmljihran’i  among  fact.  Interestingly.  rnfTnpmtaiifmal  nn-TJcmrli  on  reinforcement  learning  has  found 
that  adding  hierarchies  to  the  process  can  greatly  improve  learning  performance  . B ■*  Our  ancestors 
(reptiles  and  early  mammals)  were  largely  driven  by  the  basal  ganglia,  whereas  mammalian  evolution  has 
hngely  expanded  the  relative  rise  of  the  ne-ocortex.  By  consistently  increasing  the  rise  ratio  of  the 
neocortex  to  the  basal  ganglia,  mammalian  brain  evolution  may  be  soiling  a  specific  computational 
pnWlg  ]'1- Our  understanding  of  human  and  animal  learning  abilities  is  being  advanced  by  these 
computational  studies,  and  we  are  developing  novel  methods  for  machine  learning,  enabling  more 
powerful  computer  algorithms  for  analysis  of  complex  data  ranging  from  medical  to  commercial  to 
fimanriial  applications. 

Meanwhile,  as  study  of  these  primary  cord  co-  striatal  brain  structures  remains  very  much  still  m 
progress,  great  advances  have  been  made  in  deep,  computational  understanding  of  certain  circumscribed 
brain  systems,  in  particular  those  involved  in  early  sensory  transduction  and  perception.  The  results  have 
been  striking. 

Analysis  of  cochlear  mechanisms  has  led  to  the  construction  of  prosthetics  that  serve  today  as 
cures  for  more  than  100,000  people  who  have  lost  their  hearing.®  Retinal  prosthetics  are  in  advanced 
development.*1  In  a  recent  study,  patients  with  retinal  implants  recognized  printed  letters  of  size  and 
distance  comparable  to  reading  a  book  in  relatively  low  light.  And  experimental  prosthetic  arms  can 
respond  to  brain-ini  dated  control;  people  learn  to  control  the  arm  simply  by  deciding  to  move  it.F:d- a ' 

These  sensory  and  motor  findings  base  also  led  to  forma  lira  hops  of  the  general  problem  of  acting 
in  environments  that  are  only  partly  observable  and  are  dynamically  changing,  such  as  robotics  or 
automated  navigation:  the  result  is  a  set  of  increasingly  impress  tie  robotic  methods-  that  see  and  navi  gate 
in  complex  surroundings. K  In  a  series  of  trials  nm  by  the  Department  of  Defense  over  the  last  several 
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years,  vehicle: were,  for  the  first  lame,  able  to  navigate  through  real  urban  traffic,  merging.,  passing 
parking;,  and  negotiating  intersections,  with  no  human  control.  Re final  al  gonthms  operate  equally  well  on 
other  sensors  such  as  radar;  and  prosthetic  limb  algorithms  are  wholly  applicable  to  robots.  Many  of  the 
algorithms  that  operate  robots  and  automated  vehicles  are  closely  related  to  those  that  operate  prosthetic 
limbs. 

As  we  come  to  computationally  understand  how  these  peripheral  sensorimotor  systems  work,  the 
distinction  between  natural  and  artificial  is  being  eroded.  A  breed  of  robots  that  share  many  of  our  own 
dexterity  and  perceptual  abilities  is  likely  to  emerge  directly  horn  this  research.  As  these  increasingly 
biologically -based  robots,  orbiots.  come  to  replace  human  skilled  labor,  the  economic  and  social 
consequences  may  be  substantial. 

From  Percept  to  Concept 

The  primary  differences  between  human  brains  and  those  of  other  animals  lie  not  m  our  sensory 
or  motor  mpr-han  i  rmr-  which  are  largely  shared  across  many  species,  but  rather  in  cognitive  abilities: 
association,  representation,  reasoning.  Despate  great  advances  in  peripheral  prosthetics,  there  is  no 
commensurate  understanding  of  advanced  cognition 

The  abilities  of  peripheral  circuits  (retina,  cochlea,  initial  thalamic  and  cortical  regions)  are 
largely  built  m  at  birth  via  genetic  programs  and  shaped  in  early  childhood  during  developmentally 
critical  periods.  In  contrast,  the  rest  of  the  neocortex,  wdl  use  those  built-in  systems  to  acquire  masses  of 
specific  information  about  the  environment  over  a  lifetime.  Neocortical  circuits  are  not  bom  with 
knowledge  of  particular  scenes,  feces,  or  actions;  these  are  acquired  through  sensorimotor  experience: 
observing  and  interacting  with  objects  and  events  in  our  surroundings.  Cortical  circuits  are  engaged 
almost  entirely  in  fact  learning:  rapid,  permanent  acquisition  and  organization  of  everyday  occurrences. 
The  low-level  biological  m^r-hanv-m-T.  nmdprpim-iiii  g-  lrmg--i)pirm  feet  IrarniTig  (permanent,  anatomical 
synaptic  changes,  rather  than  inherently  ephemeral  fTipnnf  j1  changes}  are  becoming  understood.^  But  the 
neocortex  is  not  just  a  passive  warehouse  of  billions  of  isolated  facts;  we  can  arbitrarily  associate  them, 
recall  them,  embellish  them  31  Association,  recall,  retrieval,  organisation — all  that  we  can  actually  do  with 
memory — depends  on  mArh  jtii  rn-i^:  that  are  as  yet  still  unknown. 

Early  cortical  areas,  then,  deal  with  recognizing  objects  (even  in  different  lighting  settings,  and 
clutter),  but  some  laboratories  are  increasingly  focusing  on  cortical  circuits  that  are  beyond  the  early 
sensory  areas:  the  vast  remainder  of  the  neocorfex  that  somehow  encodes  sequences,  associations,  jrd 
abstract  relations/3, 

Seeing  a  phone,  we  perceive  not  only  its  visual  form  but  also  its  affordances  (cal  ling,  texting, 
photographing,  playing  music),  our  memories  of  it  (when  we  got  if.  where  we  have  recently  used  it),  and  a 

6 


58 

Approved  for  public  release;  distribution  unlimited. 


Cerebrum,  January  2011 


years,  vehicle: were,  for  the  first  lame,  able  to  navigate  through  real  urban  traffic,  merging.,  passing 
parking;,  and  negotiating  intersections,  with  no  human  control.  Re final  al  gonthms  operate  equally  well  on 
other  sensors  such  as  radar;  and  prosthetic  limb  algorithms  are  wholly  applicable  to  robots.  Many  of  the 
algorithms  that  operate  robots  and  automated  vehicles  are  closely  related  to  those  that  operate  prosthetic 
limbs. 

As  we  come  to  computationally  understand  how  these  peripheral  sensorimotor  systems  work,  the 
distinction  between  natural  and  artificial  is  being  eroded.  A  breed  of  robots  that  share  many  of  our  own 
dexterity  and  perceptual  abilities  is  likely  to  emerge  directly  horn  this  research.  As  these  increasingly 
biologically -based  robots,  orbiots.  come  to  replace  human  skilled  labor,  the  economic  and  social 
consequences  may  be  substantial. 

From  Percept  to  Concept 

The  primary  differences  between  human  brains  and  those  of  other  animals  lie  not  m  our  sensory 
or  motor  mpr-han  i  rmr-  which  are  largely  shared  across  many  species,  but  rather  in  cognitive  abilities: 
association,  representation,  reasoning.  Despate  great  advances  in  peripheral  prosthetics,  there  is  no 
commensurate  understanding  of  advanced  cognition 

The  abilities  of  peripheral  circuits  (retina,  cochlea,  initial  thalamic  and  cortical  regions)  are 
largely  built  m  at  birth  via  genetic  programs  and  shaped  in  early  childhood  during  developmentally 
critical  periods.  In  contrast,  the  rest  of  the  neocortex,  wdl  use  those  built-in  systems  to  acquire  masses  of 
specific  information  about  the  environment  over  a  lifetime.  Neocortical  circuits  are  not  bom  with 
knowledge  of  particular  scenes,  feces,  or  actions;  these  are  acquired  through  sensorimotor  experience: 
observing  and  interacting  with  objects  and  events  in  our  surroundings.  Cortical  circuits  are  engaged 
almost  entirely  in  fact  learning:  rapid,  permanent  acquisition  and  organization  of  everyday  occurrences. 
The  low-level  biological  m^r-hanv-m-T.  nmdprpim-iiii  g-  lrmg--i)pirm  feet  IrarniTig  (permanent,  anatomical 
synaptic  changes,  rather  than  inherently  ephemeral  fTipnnf  j1  changes}  are  becoming  understood.^  But  the 
neocortex  is  not  just  a  passive  warehouse  of  billions  of  isolated  facts;  we  can  arbitrarily  associate  them, 
recall  them,  embellish  them  31  Association,  recall,  retrieval,  organisation — all  that  we  can  actually  do  with 
memory — depends  on  mArh  jtii  rn-i^:  that  are  as  yet  still  unknown. 

Early  cortical  areas,  then,  deal  with  recognizing  objects  (even  in  different  lighting  settings,  and 
clutter),  but  some  laboratories  are  increasingly  focusing  on  cortical  circuits  that  are  beyond  the  early 
sensory  areas:  the  vast  remainder  of  the  neocorfex  that  somehow  encodes  sequences,  associations,  jrd 
abstract  relations/3, 

Seeing  a  phone,  we  perceive  not  only  its  visual  form  but  also  its  affordances  (cal  ling,  texting, 
photographing,  playing  music),  our  memories  of  it  (when  we  got  if.  where  we  have  recently  used  it),  and  a 
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wealth  of  potential  associations  (oui  ringtone.  whom  n'E  might  call  whethei  it  is  charged,  etc.).  Hie 
questions  of  how  cross-modal  information  is  learned  and  integrated,  and  m  whaf  form  the  knowledge  is 
stored — how  percepts  become  concepts — now  conshhrie  the  primary  frontier  of  work  m  computational 
neuroscience.  In  this  borderland  between  perception  and  cognition,  the  peripheral  language  of  the  senses 
is  transmuted  to  the  infernal  lingua  franca  of  the  brain,  freed  from  literal  sensation  amd  formulated  into 
internal  representations  that  can  include  a  wealth  of  associations. 

Erven  our  simplest  perceptions  often  rely  on  top-down  processing:  using  stored  memory 
representations  to  inform  our  ongoing  perception  and  recognition.  In  some  circumstances,  we  can 
recognize  objects  in  just  tens  of  milliseconds. 2  so  rapidly  that  it  is  unlikely  that  any  top-down  pathways 
are  yet  engaged.  Yet  once  we’re  beyond  simple  recognition,  to  the  fir  richer  range  of  inference, 
association,  and  even  language,  memories  strongly  influence  our  perceptions.  Merely  thinking  of  a  car  is 
sufficient  to  activate  the  same  early  visual  areas  that  would  have  been  triggered  by  actually  seeing  the  car, 
including  its  shape,  size,  color,  and  other  features. 3 2 

These  early  visual  areas  are  just  one  instance  of  the  spread  of  activation  from  a  triggering 
memory.  2  Thinking  of  a  car  may  also  activate  many  other  areas,  as  yet  largely  unmapped,  that 
encode  knowledge  of  how  to  open  car  doors,  turn  ignition  key s.  steer,  accelerate,  brake — or  infbnnatkm 
about  what  particular  car  you  own,  where  it  is  parked,  and  so  on.  Today  we  can  experimentally  test  for 
visual  shape  information  because  we  know  a  great  deal  about  how  to  decode  neural  responses  that  occur 
in  early  visual  areas, |lle' but  we  have  comparatively  modest  data  for  other  associative  knowledge  . 1  ’ 
Computational  models  of  spreading  activation1 1  Jl  are  now  striving  to  make  contact  with  specific  neural 
TTwhar  i  snv-.  and  brain  pathways,  to  arrive  at  convergent  hypotheses  like  those  of  peripheral  sensory 
systems. 

Computing  Individual  Differences:  From  Neurnflypes  to  Cognolypes 

Though  all  of  us  have  e  xtraordman  1  v  -nTniljr  brains.,  even  -^nall  differences  c  an  be  striking. 
Whether  particular  characteristics  are  genetic,  developmental,  or  learned  is  still  often  impossible  to 
ascertain,  but  individual  behavioral  differences  are  highly  likely  to  directly  correspond  to  individual  brain 
differences,  whether  genetic  or  acquired.  Most  work  in  computational  neuroscience — from  perception  to 
cognition,  from  anatomy  to  computational  models — has  focused  on  one  agent  at  a  time,  one  brain  at  a 
time.  A  further  frontier  will  be  to  confront  differences  JTwvng1  individuals. 

Our  bodies  are  built  by  genetic  programs  that  became  locked  into  particular  patterns  early  on  in 
mammalian  evolution:  four  appendages;  eyes  above  nose  above  mouth  between  ears;  ten  fingers  and  ten 
toes.  We  are  not  optimised  to  have  just  these  features  and  no  others;  most  of  the  variations  that  we  might 
imagine — nose  above  eyes;  five  limbs;  tentacles  instead  of  hands — have  never  been  tried  by  evolution. 
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patient.  And  there  are  risks-:  the  surgical  implantation  procedure  may  lead  to  a  higher  incidence  of 
meningitis.  * '*■ 1 1 '  Moreover;,  there  are  compile ations :  some  in  tht*  deaf  community  find  cochlear 

implant:-  to  be  ethically  misplaced,  arguing  that  the  deaf  should  not  be  thought  of  as  disabled  at  al,  but 
rather  as  a  '"minority  cultural  croup.1’31® 

What  ofbraiu  parts  that  are  deeper  than  just  the  peripheral  hearing  system?  Traumatic  brain 
injury  can  cause  debilitating  deficits  in  memory  and  cognition:  at  present,  such  injuries  are  extremely 
difficult  even  to  diagnose,  let  alone  to  treat.  Implants  to  restore  lost  cognitive  abilities  for  such  accident 
victims  would  be  revolutionary,  and  would  be  welcomed 

But  if  implants  emted  for  accident-induced  cognitive  losses,  could  they  also  be  used  to  augment 
uninjured  cognitive  function?  There  is  suggestive  evidence  from  drugs:  some  Alzheimer's  medications 
may  improve  memory  in  people  with  mild  cognitive  impairment — but  the  FDA  has  not  yet  approved  the 
use  of  any  treatments  for  these  lesser  conditions. 1  13£i  How  would  regulators  at  the  FDA  react  if  it 

became  possible  to  augment  our  brains — implants  to  help  us  dunk  faster  or  to  increase  our  memory 
capacity?  The  economic,  social,  and  political  concomitants  of  such  technology  would  surety  eclipse  those 
j tim rig  from  cochlear  implants. 

Each  brain  contains  idiosyncrasies;  our  brains-  define  who  we  are.  The  way  we  interact,  the  kinds 
of  decisions  we  make,  the  connections  we  perceive — all  anse  from  the  still-obscure  mechanisms  of  the 
vast  span  of  thalamocortical  circuits  and  cortico-  striatal  loops  in  our  heads.  These  repeating  components 
give  us  our  nirunriimli  iti  abilities,  our  uniquely  Turman  faculties,  our  individual  characteristics.  The 
computational  understanding  of  individual  and  group  differences  will  likely  lead  to  a  new  science  of 
different  types  of  cognitive  behavior,  with  implications  ranging  from  Law  to  education.  The  formerly 
familial-  terrain  of  human  nature  may  appear  quite  different  m  this  light;  perhaps,  arriving  there,  we  will 
truly  know  the  plate  for  the  first  time. 

Our  abilities  are  not  inimitable;  brain  circuits  are  circuits,  albeit  nonstandard  ones,  and  they  will 
yield  to  analysis.  As  computational  neuroscience  comes  to  demystify  tbpm  we  verge  on  an  era  of  new 
frontiers  in  science  and  Trwadifinp  in  which  we  can  increasingly  repair,  enhance,  and  likely  supplant  the 
biological  engines  we  think  with. 
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List  of  Acronyms,  Abbreviations,  and  Symbols 


Acronym 

Description 

BORN 

branching  object  relation  notation 

bovw 

bag  of  visual  words 

EM 

expected  maximization 

CSL 

cortico-striatal  loop 

JLC 

joint  localization  and  clustering 

Knn 

k  nearest  neighbor 

SVM 

support  vector  machine 

VTV 

vision  for  time-varying  images 
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